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PREFACE TO THE 
THIRD EDITION 


The purpose of the third edition of this book is to give a sound, a nd selficon- 
tained (in the sense that the necessary probability theory is jncluded) introduction 
TcTclassical or mainstream statistical theory. It is not a statistical-methods- 
cookbookTBor a compendium of statistical theories, nor is it a mathematics 
book. The book is intended to be a textbook, aimed for use in the traditional 
full-year upper-division undergraduate course in probability and statistics, 
or for use as a text in a course designed for first-year graduate students. The 
latter course is often a “service course,” offered to a variety of disciplines. 

No previous course in probability or statistics is needed in order to study 
the book. The mathematical preparation required is the conventional full-year 
calculus course which includes series expansion, multiple integration, and par- 
tial differentiation. Linear algebra is not required. An attempt has been 
made to talk to the reader. Also, we have retained the approach of presenting 
the theory with some connection to practical problems. The book is not mathe- 
matically rigorous. Proofs, and even exact statements of results, are often not 
grvem Instead, we have tried to impart a “feel” for the theory. 

The book is designed to be used in either the quarter sysfem or the semester 
system. In a quarter system. Chaps. I through Y could be covered in the first 



XiV PREFACE TO THE third EDITION 


quarter, Chaps VI through part of VHI the second quarter, and the rest of the 
book the third quarter In a semester system. Chaps I through VI could be 
covered the first semester and the remaining chapters the second semester 
Chapter VI is a “bridging” chapter, it can be considered to be a part of "proba- 
bility” or a part oF "statistics ” Several sections or subsections can be omitted 
without disrupting the continuity of presentation For example, any of the 
following could be omitted Subsec 4 5 of Chap II, Subsecs , 2 6, 3 5, 4 2, and 
4 3 of Chap III, Subsec 5 3 of Chap VI, Subsecs 23, 3 4, 43 and Secs 6 
through 9 of Chap VII, Secs 5 and 6 of Chap VIII, Secs 6 and 7 of Chap IX, 
and all or part of Chaps X and XI Subsection 5 3 of Chap VI on extreme- value 
theory is somewhat more difficult than the rest of that chapter In Chap VII, 
Subsec 7 1 on Bayes estimation can be taught without Subsec 3 4 on loss and 
risk functions but Subsec 7 2 cannot Parts of Sec 8 of Chap VII utilize matrix 
notation The many problems are intended to be essential for learning the 
material m the book. Some of the more difficult problems have been starred 

ALEXANDER M MOOD 
FRANKLIN A GRAYBILL 
DUANE C. BOES 



EXCERPTS FROM THE FIRST 
AND SECOND EDITION PREFACES 


This book developed from a set of notes which I prepared in 1945. At that time 
there was no modern text available specifically designed for beginning students 
of mathematical statistics. Since then the situation has been relieved consider- 
ably, and had I known in advance what books were in the making it is likely 
that I should not have embarked on this volume. However, it seemed suffi- 
ciently different from other presentations to give prospective teachers and stu- 
dents a useful alternative choice. 

The aforeme ntioned notes were used as text material for three years at Iowa 
State College in a course offered to senior and first-year graduate students. 
The only prerequisite for the course was one year of calculus, and this require- 
ment indicates the level of the book. (The calculus class at Iowa State met four 
hours per week and included good coverage of Taylor series, partial differentia- 
tion, and multiple integration.) No previous knowledge of statistics is assumed. 

This is a statistics book, not a mathematics book, as any mathematician 
will readily see. Little mathematical rigor is to be found in the derivations 
simply because it would be boring and largely a waste of time at this level. Of 
course rigorous thinking is quite essential to good statistics, and I have been at 
some pains to make a show of rigor and to instill an appreciation for rigor by 
pointing out various pitfalls of loose arguments. 
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While this text is primarily concerned with the theory of statistics, full 
co gnizanc e has been taken of those students who fear that a moment may be 
wasted in mathematical frivolity All new subjects are supplied with a little 
scenery from practical affairs, and, more important, a serious effort has been 
made m the problems to illustrate the variety of ways m which the theory may 
be applied 

The problems are an essential part of the book They range from simple 
numerical examples to theorems needed m subsequent chapters They include 
important subjects which could easily take precedence over material in the text, 
the relegation of subjects to problems was based rather on the feasibility of such 
a procedure tftar on the priority of the subject For example, the” matter of 
correlation is dealt with almost entirely m the problems It seemed to me in- 
efficient to cover multivariate situations twice in detail, i e , with the regression 
model and with the correlation model The emphasis in the text proper is on 
the more general regression model 

The author of a textbook is indebted to practically everyone who has 
touched the field, and I here bow to all statisticians However, in giving credit 
to contributors one must draw the line somewhere, and 1 have simplified matters 
by drawing it very high, only the most eminent contributors are mentioned in 
the book 

I am indebted to Catherine Thompson and Maxine Merrmgton, and to 
E S Pearson editor of Btometrika, for permission to include Tables III and V, 
which are abridged versions of tables published in Biometrika I am also in- 
debted to Professors R A Fisher and Frank Yates, and to Messrs Oliver and 
Boyd, Ltd , Edinburgh, for permission to reprint Table IV from their book 
“Statistical Tables for Use in Biological, Agricultural and Medical Research ” 

Since the first edition of this book was published in 1950 many new statis- 
tical techniques have been made available and many techniques that were only in 
the domain of the mathematical statistician arc now useful and demanded by 
the applied statistician To include some of this material we have had to elim- 
inate other material else the book would have come to resemble a compendium 
The general approach of presenting the theory with some connection to prac- 
tical problems apparently contributed significantly to the success of the first 
t&fatm ani - we have tenti to manAsm ftstoftt vr. tint pstseafc tdrtVrem 
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PROBABILITY 


1 INTRODUCTION AND SUMMARY 

The purpose of this chapter is to define probability and discuss some of its prop- 
erties. Section 2 is a brief essay on some of the different meanings that have 
been attached to probability and may be omitted by those who are interested 
only in mathematical^ (axiomatic) probability, which is defined in Sec. 3 and 
used throughout the remainder of the text. Section 3 is subdivided into six 
subsections. The first, Subsec. 3.1, discusses the concept of probability models. 
It provides a real-world setting for the eventual mathematical definition of 
probability. A review of some of the set theoretical concepts that are relevant 
to probability is given in Subsec. 3.2. Sample space and event space are 
defined in Subsec. 3.3. Subsection 3.4 commences with a recall of the definition 
of a function. Such a definition is useful since many of the words to be defined 
in this and coming chapters (e.g., probability, random variable, distribution, 
etc.) are defined as particular functions. The indicator function, to be used 
extensively in later chapters, is defined here. The probability axioms are pre- 
sented, and the probability function is defined. Several properties of this prob- 
ability function "are stated. The culmination of this subsection is the definition 
of a probability space. Subsection 3.5 is devoted to examples of probabilities 
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defined on finite sample spaces The related concepts of independence of 
events and conditional probability are discussed in the sixth and final subsection 
Bayes’ theorem, the multiplication rule, and the theorem of total probabilities 
are proved or derived, and examples of each are given 

Of the three main sections included in this chapter, only Sec 3, which is 
by far the longest, is vital The definitions of probability, probability space, 
conditional probability, and independence, along with familiarity with the 
properties of probability, conditional and unconditional and related formulas, 
are the essence of this chapter This chapter is a background chapter, it intro* 
duces the language of probability to be used in developing distri bution theory v 
whuchjs the backbone of the theory of statistics " 


2 KINDS OF PROBABILITY 

2,1 Introduction 

One of the fundamental tools of statistics is probability, which had its formal 
beginnings with games of chance in the seventeenth century 

Games of chance, as the name implies, include such actions as spinning a 
roulette wheel, throwing dice, tossing a coin, drawing a card etc , in which the 
outcome of a trial is uncertain However, it is recognized that even though the 
outcome of any particular trial may be uncertain, there is a predictable long 
term outcome It is known, for example, that in many throws of an ideal 
(balanced, symmetrical) coin about one half of the trials will result in heads 
It is this long term, predictable regularity that enables gaming houses to engage 
in the business 

A similar type of uncertainly and long term regularity often occurs m 
experimental science For example, in the science of genetics it js uncertain 
whether an offspring will be male or female, but in the long run it is known 
approximately what percent of offspring will be male and what percent will be 
female A life insurance company can not predict which persons m. the United 
States will die at age 50, but it can predict quite satisfactorily how many people 
in the United States will die at that age 

First we shall discuss the classical, or a prion, theory of probability, then 
we shall discuss the frequency theory Development of the axiomatic approach 
will be deferred until Sec 3 
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2.2 Classical or A Priori Probability 

As we stated in the previous subsection, the theory of probability in its early 
stages was closely associated with games of chance. This association prompted 
the classical definition. For example, suppose that we want the probability of 
the event that an ideal coin will turn up heads. We argue in this manner: Since 
there arc only two ways that the coin can fall, heads or tails, and since the coin 
is well balanced, one would expect that the coin is just as likely to fall heads as 
tails; hence, the probability of the event of a head will be given the value 
This kind of reasoning prompted the following classical definition of prob- 
ability. 

Definition 1 Classical probability If a random experiment can result 
in n mutually exclusive and equally likely outcomes and if n A of these 
outcomes have an attribute A , then the probability of A is the fraction 
njn. HU 

We shall apply this definition to a few examples in order to illustrate its meaning. 

If an ordinary die (one of a pair of dice) is tossed — therearesixpossibleout- 
comes — any one of the six numbered faces may. turn up. These six outcomes 
are mutually exclusive since two or more faces cannot turn up simultaneously. 
And if the die is fair, or true, the six outcomes are equally likely; i.e., it is expected 
that each face will appear with about equal relative frequency in the long run. 
Now suppose that we want the probability that the result of a toss be an even 
number. Three of the six possible outcomes have this attribute. The prob- 
ability that an even number will appear when a die is tossed is therefore or |. 
Similarly, the probability that a 5 will appear when a die is tossed is £. The 
probability that the result of a toss will be greater than 2 is §. 

To consider another example, suppose that a card is drawn at random from 
an ordinary deck of playing cards. The probability of drawing a spade is 
readily seen to be or £. The probability of drawing a number between 5 
and 10, inclusive, is yf, or 

The application of the definition is straightforward enough in these simple 
cases, but it is not always so obvious. Careful attention must be paid to the 
qualifications “ mutually exclusive,” " equally likely,” and “ random.” Suppose 
that one wishes to compute the probability of getting two heads if a coin is 
tossed twice. He might reason that there are three possible outcomes for the 
two tosses: two heads, two tails, or one head and one tail. One of these three 
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outcomes has the desired attribute, ; e , two heads, therefore the probability is 
i This reasoning is faulty because the three given outcomes are not equally 
likely The third outcome, one head and one tail, can occur in two ways 
since the head may appear on the first toss and the tail on the second or the 
head may appear on the second toss and the tad on the first Thus there are 
four equally likely outcomes HH, HT, TH, and TT The first of these has 
the desired attribute, while the others do not The correct probability is there- 
fore i The result would be the same if two ideal coins were tossed simul- 
taneously 

Again, suppose that one wished to compute the probability that a card 
drawn from an ordinary well shuffled deck will be an ace or a spade In enu- 
merating the favorable outcomes, one might count 4 aces and 13 spades and 
reason that there are 17 outcomes with the desired attribute This is clearly 
incorrect because these 17 outcomes are not mutually exclusive since the ace of 
spades is both an ace and a spade There are 1 6 outcomes that are favorable to 
an ace or a spade, and so the correct probability is -f-§, or ys 

We note that by the classical definition the probability of event A is a 
number between 0 and 1 inclusive The ratio n A fn must be less than or equal to 
I since the total number of possible outcomes cannot be smaller than the 
number of outcomes with a specified attribute If an event is certain to happen, 
its probability is 1 , if it is certain not to happen, its probability ts 0 Thus, the 
probability of obtaining an 8 in tossing a die is 0 The probability that the 
number showing when a die is tossed is less than 10 is equal to 1 

The probabilities determined by the classical definition are called a priori 
probabilities When one states that the probability of obtaining a head in 
tossing a coin is i, he has arrived at this result purely by deductive reasoning 
The result does not require that any com be tossed or even be at hand We say 
that if the com is true, the probability of a head is but this is little more than 
saying the same thing in two different ways Nothing is said about how one 
can determine whether or not a particular com i$ true 

The fact that we shall dea' with ideal objects in developing a theory of 
probability will not trouble us because that is a common requirement of mathe- 
matical systems Geometry, for example, deals with conceptually perfect 
circles, lines with zero width, and so forth, but it is a useful branch ofknowl 
edge, which can be applied to diverse practical problems 

There are some rather troublesome limitations in the classical, or a prion, 
approach It is obvious, for example, that the definition of probability must 
be modified somehow when the total number of possible outcomes is infinite 
One might seek, for example, the probability that an integer drawn at random 
from the positive integers be even The intuitive answer to this question is 
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If one were pressed to justify this result on the basis of the definition, he might 
reason as follows: Suppose that we limit ourselves to the first 20 integers; 10 
of these are even so that the ratio of favorable outcomes to the total number is 
To * or i* Again, if the first 200 integers are considered, 100 of these are even, 
and the ratio is also In general, the first 2 N integers contain N even integers; 
if we form the ratio N/2N and let N become infinite so as to encompass the whole 
set of positive integers, the ratio remains The above argument is plausible, 
and the answer is plausible, but it is no simple matter to make the argument 
stand up. It depends, for example, on the natural ordering of the positive 
integers, and a different ordering could produce a different result. Thus, one 
could just as well order the integers in this way: 1, 3, 2; 5, 7, 4; 9 , 11 , 6 ;..., 
taking the first pair of odd integers then the first even integer, the second pair 
of odd integers then the second even integer, and so forth. With this ordering, 
one could argue that the probability of drawing an even integer is The 
integers can also be ordered so that the ratio will oscillate and never approach 
any definite value as N increases. 

There is another difficulty with the classical approach to the theory of 
probability which is deeper even than that arising in the case of an infinite 
number of outcomes. Suppose that we toss a coin known to be biased in 
favor of heads (it is bent so that a head is more likely to appear than a tail). 
The two possible outcomes of tossing the coin are not equally likely. What is 
the probability of a head ? The classical definition leaves us completely helpless 
here. 

Still another difficulty with the classical approach is encountered when we 
try to answer questions such as the following: What is the probability that a 
child bom in Chicago will be a boy? Or what is the probability that a male 
will die before age 50? Or what is the probability that a cookie bought at a 
certain bakery will have less than three peanuts in it? All these are legitimate 
questions which we wantto bring into the realm of probability theory. However, 
notions of “symmetry,” "equally likely,” etc., cannot be utilized as they could 
be in games of chance. Thus we shall have to alter or extend our definition to 
bring problems similar to the above into the framework of the theory. This 
more widely applicable probability is called a posteriori probability, or frequency, 
and will be discussed in the next subsection. 


2.3 A Posteriori or Frequency Probability 

A coin which seemed to be well balanced and symmetrical was tossed 100 times, 
and the outcomes recorded in Table 1 . The important thing to notice is that the * 
relative frequency of heads is close to This is not unexpected since the coin 
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was symmetrical, and it was anticipated that in the long run heads would occur 
about one-half of the time Tor another example, a single die was thrown 300 
times, and the outcomes recorded in Table 2 Notice how close the relative 
frequency of a face with a l showing is to i, similarly for a 2, 3, 4, 5, and 6 
These results arc not unexpected since the die which was used was quite sym- 
metrical and balanced, it was expected that each face would occur with about 
equal frequency in the long run This suggests that we might be willing to use 
this relative frequency in Table 1 as an approximation for the probability that 
the particular com used will come up heads or we might be willing to use the 
relative frequencies in Table 2 as approximations for the probabilities that 
various numbers on this die will appear Note that although the relative fre- 
quencies of the different outcomes are predictable, the actual outcome of an 
individual throw is unpredictable 

In fact, it seems reasonable to assume for the com experiment that there 
exists a number, label it p, which is the probability of a head Now if the com 
appears well "balanced, symmetrical, and true, we might use T>efinition 1 and 
state that p is approximately equal to U is only an approximation to set p 
equal to J since for this particular com we cannot be certain that the two cases, 
heads and tads, are exactly equally likely But by examining the balance and 
symmetry of the com it may seem quite reasonable to assume that they are 
Alternatively, the coin could be tossed a large number of times, the results 
recorded as in Table 1 .and the relative frequency ofa head used as an approxima- 
tion for p In the experiment with a die, the probability p 2 of a 2 showing 
could be approximated by using Definition 1 ot by using the relative frequency 
in Table 2 The important thing is that we postulate that there is a number p 
which is defined as the probability of a head with the com or a number p t 
which is the probability of a 2 showing in the throw of the die Whether we me 
Definition 1 or the relative frequency for the probability seems unimportant in 
the examples cited 


Table t RESULTS OF TOSSING A COIN 100 TIMES 


Outcome 

Observed 

Frequency 

Observed relative 
frequency 

Long run expected 
relative frequency 
of a balanced coin 

H 

56 

56 

.50 

T 

44 

44 

50 

Total 

100 

1 00 

1 CO 




2 


KINDS or PROBABILITY 7 


Suppose, as described above, that the coin is unbalanced so that we are 
quite certain from an examination that the two cases, heads and tails, are not 
equally likely to happen. In these cases a number p can still be postulated 
as the probability that a head shows, but the classical definition will not help us 
to find the value of /?. We must use the frequency approach or possibly some 
physical analysis of the unbalanced coin. 

In many scientific investigations, observations are taken which have an ele- 
ment of uncertainty or unpredictability in them. As a very simple example, sup- 
pose that we want to predict whether the next baby born in a certain locality will 
be a male or a female. This is individually an uncertain event, but the results of 
groups of births can be dealt with satisfactorily. We find that a certain long- 
run regularity exists which is similar to the long-run regularity of the frequency 
ratio of a head when a coin is thrown. If, for example, we find upon examination 
of records that about 51 percent of the births are male, it might be reasonable to 
postulate that the probability of a male birth in this locality is equal to a number 
p and take .51 as its approximation. 

To make this idea more concrete, we shall assume that a series of observa- 
tions (or experiments) can be made under quite uniform conditions. That is, 
an observation of a random experiment is made; then the experiment is repeated 
under similar conditions, and another observation taken. This is repeated 
many times, and while the conditions are similar each time, there is an uncon- 
trollable variation which is haphazard or random so that the observations are 
individually unpredictable. In many of these cases the observations fall into 
certain classes wherein the relative frequencies are quite stable. This suggests 
that we postulate a number p , called the probability of the event, and approximate 
p by the relative frequency with which the repeated observations satisfy the 


Table 2 RESULTS OF TOSSING A DIE 300 TIMES 


Outcome 

Observed 

Frequency 

Observed 

relative frequency 

Long-run expected 
relative frequency 
of a balanced die 

1 

51 

.170 

.1667 

2 

54 

.180 

.1667 

3 

48 

.160 

.1667 

4 

51 

.170 

.1667 

5 

49 

.163 

.1667 

6 

47 

.157 

.1667 

Total 

300 

1.000 

1.000 
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event For instance, suppose that the experiment consists of sampling the 
population of a large city to see how many voters favor a certain proposal 
The outcomes are “favor" or “do not favor,” and each voter’s response is un- 
predictable, but it is reasonable to postulate a number p as the probability that 
a given response will be “ favor ” The relative frequency of “ favor" responses 
can be used as an approximate value for p 

As another example, suppose that the experiment consists of sampling 
transistors from a large collection of transistors We shall postulate that the 
probability of a given transistor being defective is p We can approxi mate p by 
selecting several transistors at random from the collection and computing the 
relative frequency of the number defective 

The important thing is that we can conceive of a senes of observations or 
experiments under rather uniform conditions Then a number p can be postu- 
lated as the probability of the event A happening, and p can be approximated by 
the relative frequency of the event A in a series of experiments 


3 PROBABILITY-AXIOMATIC 
3.1 Probability Models 

One of the aims of science is to predict and describe events in the world in which 
we live One way in which this is done is to construct mathematical models 
which adequately describe the real world For example, the equation s — igl 3 
expresses a certain relationship between the symbols s, g and / It is a mathe- 
matical model To use the equation s = ^gt 1 to predict s the distance a body 
falls as a function of time t, the gravitational constant g must be known The 
latter is a physical constant which must be measured by experimentation if the 
equation s = \gt z is to be useful The reason for mentioning this equation is 
that we do a similar thing in probability theory, we construct a probability 
model which can be used to describe events in the real worid For example, it 
might be desirable to find an equation which could be used to predict the sex of 
each birth in a certain locality Such an equation would be very complex, and 
none has been found However, a probability model can be constructed which, 
while not very helpful in dealing, with an individual birth, is quite useful in 
dealing with groups of births Therefore, we can postulate a number p which 
represents the probability that a birth will be a male From this fundamental 
probability we can answer questions such as What is the probability that m 
ten births at least three will be males ? Or what is the probability that there will 
be three consecutive male births in the next five? To answer questions such as 
these and many similar ones, we shall develop an idealized probability model 
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The two general types of probability (a priori and a posteriori) defined 
above have one important thing in common: They both require a conceptual 
experiment in which the various outcomes can occur under somewhat uniform 
conditions* For example, repeated tossing of a coin for the a priori case, and 
repeated birth for the a posteriori case. However, we might like to bring into 
the realm of probability theory situations which cannot conceivably fit into the 
framework of repeated outcomes under somewhat similar conditions. For 
example, we might like to answer questions such as: What is the probability my 
wife loves me? Or what is the probability that World War III will start before 
January 1, 1985? These types of problems are certainly a legitimate part of 
general probability theory and are included in what is referred to as subjective 
probability. We shall not discuss subjective probability to any great extent in 
this book, but we remark that the axioms of probability from which we develop 
probability theory are rich enough to include a priori probability, a posteriori 
probability, and subjective probability. 

To start, we require that every possible outcome of the experiment under 
study can be enumerated. For example, in the coin-tossing experiment there are 
two possible outcomes: heads and tails. We shall associate probabilities only 
with these outcomes or with collections of these outcomes. We add, however, 
that even if a particular outcome is impossible, it can be included (its probability 
is 0). The main thing to remember is that every outcome which can occur 
must be included. 

Each conceivable outcome of the conceptual experiment under study will be 
defined as a sample point , and the totality of conceivable outcomes ( or sample 
points) will be defined as the sample space . 

Our object, of course, is to assess the probability of certain outcomes or 
collections of outcomes of the experiment. Discussion of such probabilities 
is conveniently couched in the language of set theory, an outline of which 
appears in the next subsection. We shall return to formal definitions and 
examples of sample space, event, and probability. 


3.2 An Aside — Set Theory 

We begin with a collection of objects. Each object in our collection will be 
called a point or element . We assume that our collection of objects is large 
enough to include all the points under consideration in a given discussion. 
The totality of all these points is c alled the space, universe , or universal set . 

We will c all it the space (anticipating th at it will become the samplej sp ace when 

we speak of probability) and denote it by £1 Let co denote an element or point 
linE Although a set can beTteftned as any" collection ofobjects, we shall 
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assume, unless otherwise stated, that all the sets mentioned in a given discussion 
consist of points in the space Cl 

EXAMPLE 1 fi = if 3 , where .Rj, is the collection of points co in the plane and 
oi = (x, y) is any pair of real numbers x and y //// 

EXAMPLE 2 Cl — {all United States citizens) (Iff 

We shall usually use capital Latin letters from the beginning of the 
alphabet, with or without subscripts, to denote sets If oms a point or element 
belonging to the set A, we shall write oeA, if to is not an element of A, we 
shall write <o $ A 

Definition 2 Subset If every element of a set A is also an element of a 
set B, then A is defined to be a subset of B, and we shall write deflor 
B => A, read “ A is contained in B" or “ B contains A ” //// 

Definitions Equivalent sets Two sets A and Bare defined to be eqitica 
lent, or equal, if A c B and B er A This will be indicated by writing 

a~b mi 

Definition 4 Empty set If a set A contains no points, it will be called 
the null set, or empty set, and denoted by <f> //// 

Definition 5 Complement The complement of a set A with respect to 
the space Cl, denoted by X, A', or Cl — A, is the set of all points that are m 
Cl but not in A //// 

Definition 6 Union Let A and B be any two subsets of Cl, then the 
set that consists of all points that are m A or B or both is defined to be 
the union of A and B and written A l> B //// 

Definition 7 Intersection Let A and B be any two subsets of Cl, then 
the set that consists of all points that are m both A and B is defined to be 
the intersectio n of A and B and is written A In B or AB //// 

Definition 8 Set difference Let A and B be any two subsets of II The 
set of all points in A that are not in B will be denoted by A - B and is 
defined as set dtjji rence //// 
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EXAMPLE 3 Let Q = {(x, >0* 0 < x ^ 1 and 0 < y < 1}, which is read the 
collection of all points (x, y ) for which 0<x<land0<j<l. Define 
the following sets: 

= {(a:,^): 0 <x^ l;0<y <i}, 
A 2 ={(x,y):Q<x<i;0<y< 1}, 
A 3 *={(x,y):0^x£y< 1}, 

A 4 = {foy): 0<x<i;0<y<i}. 

(We shall adhere to the practice initiated here of using braces to embrace 
the points of a set.) 

The set relations below follow. 

A 4 czA 2 9 , A l n A 2 = A^A 2 — A 4 \ 

A 2 uA 3 =A 4 uA 3 ; A t ={(x,y):0<x< l;}<y<l); 

Ai — A 4 — {(*, y): } <x< 1;0 <y //// 

EXAMPLE 4 Let fi, A u A 2 , and A 3 be as indicated in the diagrams in Fig. 1 
which are called Venn diagrams . //// 


The set operations of complement, union, and intersection have been 
defined in Definitions 5 to 7, respectively. These set operations satisfy quite a 
number of laws, some of which follow, stated as theorems. Proofs are omitted. 

Theorem 1 Commutative laws A\j B — B\j A and A n B ~ B n A. 

HU 

Theorem 2 Associative laws A v (B v C) = (A v B) u C, and 


A n (B n C) = (A n B) n C. Ml 

Theorem 3 Distributive laws -A n (B u C) = (A n B) u [A n C), and 
A u {B n C) = (A u B) n (A u C). Ml 

Theorem 4 (A c ) c = (A) = A; in words, the complement of A comple- 
ment equals A. Ml 

Theorem 5 A£2 = A;A u Q = Q; A(p = (p; and A u (p = A. Ml 


Theorem 6 AA = <p; A v A = Q.; A r\ A = A; and A u A = A. Ml 
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Theorem 7 (A u B) = A n B, and (/l n B) = /l u These are known 
as Morgan's laws . jjjj 

Theorem 8 ^4 - B = /lB. //// 

Several of the above laws are illustrated in the Venn diagrams in Fig. L 
Although wc will feel free to use any of the above laws, it might be instructive 
to give a proof of one of them just to illustrate the technique. For example, 
let us show that (A u B) — A n B. By definition, two sets are equal if each is 
contained in the other. Wc first show that (A v B)c A n Bb y proving that if 
weAv £, then cue T n B. Now <d e(A u B)impliesa>£/1 u B t which implies 
that a )£A and which in turn implies that co e A and coeB ; that is, 

a)E A n E. We next show that A n B cz (A v B). Let co s A n B, which means 
co belongs to both A and B. Then co £ A u B for if it did, a) must belong to at 
least one of A or B, contradicting that co belongs to both A and B; however, 
co$A\jB means co e (A u B), completing the proof. 

We defined union and intersection of two sets; these definitions extend 
immediately to more than two sets, in fact to an arbitrary number of sets. It 
is customary to distinguish between the sets in a collection of subsets of Q by 
assigning names to them in the form of subscripts. Let A (Greek letter capital 
lambda) denote a catalog of names, or indices. A is also called an index set. 
For example, if wc arc concerned with only two sets, then our index set A 
includes only two indices, say 1 and 2; so A = {1, 2). 

Definition 9 Union and intersection of sets Let A be an index set and 
{A x : ?. e A} = {A x }, a collection of subsets of fl indexed by A. The set 
of points that consists of all points that belong to A x for at least one 2 is 
called the///;/c?/7of thesets {A and is denoted by (J A x . The set of points 

Xe A 

that consists of all points that belong to A x for every A is called the inter- 
section of the sets {A x } and is denoted by f] A x . If A is empty, then define 

\c A 

(J A x - (j) and f] A x = Q. //// 

X g A X e A 


EXAMPLE 5 If A = {1 » 2, N), i.e., A is the index set consisting of the 

first N integers, then |J A x is also written as 

X e A 

U A n = A t u A 2 u u A n . HU 

n- 1 
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One of the most fundamental theorems relating unions, intersections, and 
complements for an arbitrary collection of sets is due to De Morgan 

Theorem 9 De Morgan’s theorem Let A be an index set and {A^ a 
collection of subsets of Cl indexed by A Then, 

<0 IMx = 

00 fUl= IMi U(( 

SeA ItA 

We will not give a proof of this theorem Note however, that the special 
case when the index set A consists of only two names or indices is Theorem 7 
above, and a proof of part of Theorem 7 was given in the paragraph after 
Theorem 8 

Definition 10 Disjoint or mutually exclusive Subsets A and S of D are 
defined to be mutually exclusive or disjoint if A n B~ <j> Subsets 
A lt Ai , are defined to be mutually exclusive if A t Aj = <p for every / j 

Ml 

Theorem 10 If A and B are subsets of 12 then (i) A = A B u AB and 
(n)ABnAB=*<l> 

proof (0 A - A a D = <4 n(B U 8) » AB\j AB 
= AABB*=A<t>=:<l> Mi 

Theorem 11 If A e B, then AB = A and A u B = B 

proof Left as an exercise //// 

3 3 Definitions of Sample Space and Event 

In Subsec 3 1 we described what might be meant by a probability model 
There we sard that we had tn mind some conceptual experiment whose possible 
outcomes we would like to study by assessingthe probability ofcertatn outcomes 
or collection of outcomes fn this subsection we wifi give two important 
definitions, along with some examples, that will be used in assessing these 
probabilities 

Definition 11 Sample space The sample space denoted by fl is the 
co llection or total ity of all possible outcomes of a conce p tual ex periment 

//// 
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One might try to understand the definition by looking at the individual 
words. Use of the word “space” can be justified since the sample space is the 
total collection of objects or elements which are the outcomes of the experiment. 
This is in keeping with our use of the word “space” in set theory as the collec- 
tion of all objects of interest in a given discussion. The word “sample” is 
harder to justify; our experiment is random, meaning that its outcome is un- 
certain so that a given outcome is just one sample of many possible outcomes. 

Some other symbols that are used in other texts to denote a sample space, 
in addition to L>, are S, Z, R, E, £, and A . 

Definition 12 Event and event space An event is a subset of the sample 
space. The class of all events associated with a given experiment is 
defined to be the event space. //// 

The above does not precisely define what an event is. An event will 
always be a subset of the sample space, but for sufficiently large sample spaces 
not all subsets will be events. Thus the class of ail subsets of the sample space 
will not necessarily correspond to the event space. However, we shall see that 
the class of all events can always be selected to be large enough so as to include 
all those subsets (events) whose probability we may want to talk about. If the 
sample space consists of only a finite number of points, then the corresponding 
event space will be the class of all subsets of the sample space. 

Our primary interest will not be in events per se but will be in the prob- 
ability that an event does or does not occur or happen. An event A is said to 
occur if the experiment at hand results in an outcome (a point in our sample 
space) that belongs to A . Since a point, say co , in the sample space is a subset 
(that subset consisting of the point co) of the sample space O, it is a candidate to 
be an event. Thus co can be viewed as a point in Q or as a subset of Q. To 
distinguish, let us write {cu}, rather than just co, whenever co is to be viewed as a 
subset of Q. Such a one-point subset will always be an event and will be called 
an elementary event . Also cf) and O are both subsets of O, and both will always 
be events. Q. is sometimes called the sure event. 

We shall attempt to use only capital Latin letters (usually from the begin- 
ning of the alphabet), with or without affixes, to denote events, with the excep- 
tion that cf) will be used to denote the empty set and Q the sure event. The event 
space will always be denoted by a script Latin letter, and usually & and ZF, 
as well as other symbols, are used in some texts to denote the class of all events. 

The sample space is basic and generally easy to define for a given experi- 
ment. Yet, as we shall see, it is the event space that is really essential in de- 
fining probability. Some examples follows. 
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EXAMPLE 6 The experiment is the tossing of a single die (a regular six sided 
polyhedron or cube marked on each face with one to six spots) and noting 
which face is Dp Now the die can land with any one of the six faces up , 
so there are six possible outcomes of the experiment, 

n = □, 0. 0, 0. ID) 

Let A ~ {even number of spots up} A is an event, it is a subset of fl 
^-{GD.QD.Q} Let A, = {i spots up}, /» 1 2, ,6 Each A t is an 

elementary event For this experiment the sample space is finite, hence 
the event space is all subsets of D There are 2 6 = 64 events, of which only 
6 are elementary, tn (including both <fy and Cl) See Example 19 of 
Subsec 3 5, where a technique for counting the number of events in a finite 
sample space is presented //// 


EXAMPLE 7 Toss a penny, nickel, and dime simultaneously, and note which 
side is up on each There are eight possible outcomes of this experiment 
n ={{H, H, H), (H, H, T), (H, T. H), (T, H, H), (H, T, T), (T, H, T), 
(T, T, H) (T, T, T)} We are using the first position of ( , , ), called a 
3 tuple , to record the outcome of the penny, the second position to record 
the outcome of the nickel, and the third position to record the outcome of 
the dime Let A, = {exactly i heads} , t - 0, 1 , 2, 3 For each i,A t is an 
event Note that A 0 and A j are each elementary events Again all 
subsets of £1 arc events, there are 2 8 = 256 of them //// 


EXAMPLE 8 The experiment is to record the number of traffic deaths m the 
state of Colorado next year Any nonnegative integer is a conceivable 
outcome of this experiment, so ft = {0, I, 2, } A = {fewer than 500 

deaths} = {0, 1, , 499} is an event A t = {exactly / deaths}, /= 0, I, 

, is an elernemaiy event There is an infinite number of points jn the 
upaw,, 'pwiA. « w wft , w.ttowx. vs 

infinite number of events Each subset of fi is an event //// 


EXAMPLE 9 Select a light bulb, and record the time in hours that it burns 
before burning out Any nonnegative number is a conceivable outcome 
of this experiment, so jr^O) For this sample space not all 
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subsets of Q are events; however, any subset that can be exhibited will be 
an event. For example, let 

A — {bulb burns for at least k hours but burns out before m 
hours} 

= {x: k x < m}\ 

then A is an event for any 0 <,k< m. //// 


EXAMPLE 10 Consider a random experiment which consists of counting 
the number of times that it rains and recording in inches the total rainfall 
next July in Fort Collins, Colorado. The sample space could then be 
represented by 


£i = {(/,*): i = 0, 1,2, ... and 0<*}, 

where in the 2-tuple (* , •) the first position indicates the number of times 
that it rains and the second position indicates the total rainfall. For 
example, co = (7, 2.251) is a point in Q corresponding to there being seven 
different times that it rained with a total rainfall of 2.251 inches. A = 
{(/, x): / = 5, 10 and x ;> 3} is an example of an event. //// 


EXAMPLE 11 In an agricultural experiment, the yield of five varieties of 
wheat is examined. The five varieties are all grown under rather uniform 
conditions. The outcome is a collection of five numbers {y x ,y 2 » yj >y* *ys)> 
where y t represents the yield of the ith variety in bushels per acre. Each 
y { can conceivably be any real number greater than or equal to 0. In this 
example let the event A be defined by the conditions that y 2 , y 3 , y 4 , and 
y 5 are each 10 or more bushels per acre larger than y i9 the standard 
variety. In our notation we write 

A = {(y l ,y 2 ,y3,y4,ys)‘-yj^yi + i0;y = 2, 3,4,5; mi 

Out definition of sample space is precise and satisfactory, whereas our 
definitions of event and event space are not entirely satisfactory. We said that 
if the sample space was “ sufficiently large ” (as in Examples 9 to 1 1 above), not all 
subsets of the sample space would be events; however, we did not say exactly 
which subsets would be events and which would not. Rather than developing 
the necessary mathematics to precisely define which subsets of constitute our 
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event space si, let us state some properties of si that it seems reasonable to 
require. 

(0 

(ii) If A e si, then Aesi 

(m) If A v and A 1 e si, then A x u A 2 e si 

We said earlier that we were interested in events mainly because we would 
be interested in the probability that an event happens Surely, then, we would 
want s i to include fi, the suie event Also, if A is an event, meaning we can 
talk about the probability that A occurs, then A should also be an event so that 
we can talk about the probability that A does not occur Similarly, if and 
A z are events, so should A t u A 2 be an event 

Any collection of events with properties (i) to (in) is called a Boolean 
algebra, or just algebra, of events We might note that the collection of all 
subsets of Cl necessarily satisfies the above properties Several results follow 
from the above assumed properties of si 

Theorem 12 

proof By property (i) Slesi, by (») He si, but £3 = i/>, so <f>e si 

mi 


Theorem 13 If A 2 and A t e si, then A x n A } e si 

proof A i and A 2 esf, hence A x \j A 2 , and (A x uAj)esi, but 
0J, u A 2 ) = = by De Morgan’s Jaw //// 


Theorem 14 ItAuAi, , A n e si, (hen U A, and f) A, esi 

i-i i-i 

proof Follows by induction //// 

We will always assume that out collection of events si is an algebra— 
which partially justifies our use of si as our notation for it In practice, one 
might take that collection of events of interest in a given consideration and 
enlarge the collection, if necessary, to include (i) the sure event, (u) all comple- 
ments of events already included, and (m) all finite unions and intersections of 
events already included, and this will be an algebra si Thus far, we have not 
explained why si cannot always be taken to be the collection of all subsets of fi 
Such explanation will be given when we define probability in the next subsection 
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3 A Definition of Probability 

In this section we give the axiomatic definition of probability. Although this 
formal definition of probability will not in itself allow us to achieve our goal of 
assigning actual probabilities to events consisting of certain outcomes of random 
experiments, it is another in a series of definitions that will ultimately lead to 
that goal. Since probability, as well as forthcoming concepts, is defined as a 
particular function, we begin this subsection with a review of the notion of a 
function. 

The definition of a function The following terminology is frequently used 
to describe a function: A function , say /(*), is a rule (law, formula, recipe) that 
associates each point in one set of points with one and only one point in another 
set of points. The first collection of points, say A , is called the domain , and the 
second collection, say B, the counterdomain . 

Definition 13 Function A function , say /(•)* with domain A and coun- 
terdomain B, is a collection of ordered pairs, say (a y b) y satisfying (i) a e A 
and be B; (ii) each a e A occurs as the first element of some ordered pair 
in the collection (each be B is not necessarily the second element of some 
ordered pair); and (iii) no two (distinct) ordered pairs in the collection 
have the same first element. //// 

If (a, b) e /(•)* we write b = f(a ) (read “6 equals f of u”) and call /(a) 
the value of /(•) at a. For any ae A y f(a) is an element of B; whereas /(•) is 
a set of ordered pairs. The set of all values of /(•) is called the range of /(•); 
i.e., the range of /(•) = {b e B: b = f(a) for some ae A} and is always a subset 
of the counterdomain B but is not necessarily equal to it. f(a ) is also called the 
image of a under /(*)> and a is called the preimage of/(a). 


EXAMPLE 12 Let f(-) and / 2 (‘) be the two functions, having the real line 
for their domain and counterdomain, defined by 

/i(*) = {(*>.>’):.}' = * 3 + * + *> - co < -y < co} 

and 

fl(-) = {(X' )')'■ y = X 2 , -CO < AT < co}. 

The range of f(-) is the counterdomain, the whole real line, but the range 
of / 2 (*) is all nonnegative real numbers, not the same as the counter- 
domain. //// 
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Of particular interest to us will be a class of functions that are called 
indicator functions 


Definition 14 Indicator function Let ft be any space with points <o 
and A any subset of ft The indicator function of A, denoted by l A ( }, 
is the function with domain £1 and countcrdomam equal to the set consist- 
ing of the two real numbers 0 and i defined by 




7 a ( ) dearly “indicates ” the set A 


if to e A 
if cot A 

III! 


Properties of Indicator Functions Let ft be any space and si any collection 
of subsets of ft 


(i) / A (co) = 1 — /^(£o) for every Ae si 

00 aS& -/*(»> hW M«) f°r A u ,A„esi. 

(tu) Ja.uAjv vA .(oi)«max[/ A| (e)),/ Xl (a>), y l A f(a)] foe A ly , 
A R ejd 

(tv) / A (io) ■» /» for every A e si 


Proofs of the above properties are left as an exercise 

The indicator function will be used to "indicate" subsets of the real line, 


1,(0 I ))(*)*= ho l)W 


if 0 £x<l 
otherwise. 


and if/* denotes the set of positive integers, 


r t \ — / 1 if x is some positive integer 

r*\ x ) - otherwise 


Frequent use of indicator functions will be made throughout the remainder 
of this book Often the utility of the indicator function is just notational 
efficiency as the following example shows 




EXAMPLE 13 Let the function /( ) be defined by 


for 

x^0 

for 

0<x£ 1 

x for 

i <X£2 

for 

2<x 
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By using the indicator function, f{x) can be written as 

f{x) = xl i0t n {x) -f (2 — x)I {l 1 2 p), 
or also by using the absolute value symbol as 

/(*)=(iHi-*i)w*)' mi 


Another type of function that we will have occasion to discuss is the set 
function defined as any functon which has as its domain a collection of sets 
and as its counterdomain the real line including, possibly, infinity. Examples 
of set functions follow. 


EXAMPLE 14 Let ft be the sample space corresponding to the experiment of 
tossing two dice, and let sd be the collection of all subsets of ft. For 
any Aesd define N(A) = number of outcomes, or points in ft, that are 
in A. Then A(<£) = 0, A r (ft) = 36, and N(A) = 6 if A is the event con- 
taining those outcomes having a total of seven spots up. //// 


The size-of-set function alluded to in the above example can be defined, in 
general, for any set A as the number of points in A , where A is a member of an 
arbitrary collection of sets sd . 


EXAMPLE 15 Let ft be the plane or two-dimensional euclidean space and 
sd any collection of subsets of ft for which area is meaningful. Then 
for any A e sd define Q(A) = area of A . For example, if A = {(x, y ): 0 
< x < I, 0 < y < 1}, then Q(A ) = 1 ; if A = {(*, y): x 2 + y 1 = r 2 }, then 
Q(A) = nr 2 ; and if A = {(0, 0), (I, 1)} then Q(A) = 0. //// 


The probability function to be defined will be a particular set function. 


Probability function Let Cl denote the sample space and s/ denote a collec- 
tion of events assumed to be an algebra of events (see Subsec, 3.3) that we shall 
consider for some random experiment. 
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Definition 15 Probability function A probability function P[- ] is a set 
function with domain s/ (an algebra of events)* and counterdomain the 
interval [0, 1] which satisfies the following axioms 

(l) P\A ] £: 0 for every A e sit 

00 

(m) If A u A lt is a sequence of mutually exclusive events in si 
(that is. A, c\Aj = #fot i 1,2, )andif,4, Ui4 2 u •• » 

U *t e si, then pjjj/i] = lilt 

These axioms are certainly motivated by the definitions of classical and 
frequency probability This definition of probability is a mathematical defini- 
tion, it tells us which set functions can be called probability functions, it does not 
tell us what value the probability function P[ ] assigns to a given event A We 
will have to model our random experiment m some way in order to obtain values 
for the probability of events 

P[A\ is lead “the probability of event A" ot “the probability that event A 
occurs,” which means the probability that any outcome in A occurs 

We have used brackets rather than parentheses m our notation foT a 
probability function, and we shall continue to do the same throughout the 
remainder of this book 


•In defining a probability function many authors assume that the domain of the set 
function is a sigma algebra rather than just an algebra For an algebra s/, we had the 
property 

if A l and A 2 e a/, then A, is A i e sf 

A sigma algebra differs from an algebra m that ihe above property n replaced by 

ifAi.At, , A„, e sf, then UA.ejf 

It can be shown [bat a sigma algebra is an algebra but not necessarily convers-ly 
If the domain of the probability function ts taken to be a sigma algebra (hen axiom 
(m) can be simplified to 

if Ai,Aj, is a sequence of mutually exclusive events in s/, *=!ZflAtl 

A fundamental theorem of probability theory, called the exit nrfun llitortm, slates that 
if a probability function is defined on an algebra (as wc have done) ihen It can be 
extended to a sigma algebra Since the probability function can be extended from an 
algebra to a sigma algebra, u is reasonable to begin by assuming that the probability 
function is defined on a sigma algebra 
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EXAMPLE 1 6 Consider the experiment of tossing two coins, say a penny and 
a nickel. Let ft = {(H, H), (//, 7 ), ( 7 , H ), ( 7 , 7 )} where the first com- 
ponent of (*, •) represents the outcome for the penny. Let us model this 
random experiment by assuming that the four points in ft are equally 
likely; that is, assume P[{{H, H)}] = 7 [{(/ 7 , 7 )}] = 7 [{( 7 , H)}] = 
P[{(T, 7 )}]. The following question arises: Is the 7 [-] function that is 
implicitly defined by the above really a probability function; that is, does 
it satisfy the three axioms? It can be shown that it does, and so it is 
a probability function. 


In our definitions of event and sf, a collection of events, we stated that s4 
cannot always be taken to be the collection of all subsets of ft. The reason for 
this is that for “ sufficiently large 1 ' ft the collection of all subsets of ft is so large 
that it is impossible to define a probability function consistent with the above 
axioms. 

We are able to deduce a number of properties of our function 7[*] from 
its definition and three axioms. We list these as theorems. 

It is in the statements and proofs of these properties that we will see the 
convenience provided by assuming is an algebra of events, sf is the domain 
of P[ 9 ]\ hence only members of sf can be placed in the dot position of the 
notation 7[* ]. Since s/ is an algebra, if we assume that A and B e sf, we know 
that A, A u B, AB t AB , etc., are also members of s/, and so it makes sense to 
talk about P[A] t P{A u 7], P[AB], P[AB J, etc. 

Properties of P[*] For each of the following theorems, assume that ft and 
sf (an algebra of events) are given and 7[*] is a probability function having 
domain stf. 


Theorem 15 P[4>] = 0. 

proof Take A x = <£, A 2 = <£, A z = <£, then by axiom (iii) 


which can hold only if P[4>] = 0. 


U^il = 1^1= 

f- 1 J 1=1 i=i 


//// 


Theorem 16 If A „ A n are mutually exclusive events in sf, then 


P[A t u = 

1 
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proof Let 4, +l = <£ t j « & * , then (J A t = (J A^s*, 

and 

f[0/.] - -BW -put-gf us iiii 

Theorem 17 If A is an event in sf, then 

m-i-m 

proof A uJafl, and A n X = 4>, so 

P{fi] *= P[A u A] **J>IA] + P[K) 

But = 1 by axiom (u), the result follows //// 

Theorem 18 tf/t and Be s/, then P[A\ - P[AB] + P[a5]> and P[A -B\ 
= P[AZ]=P[A]-P[AB] 

proof A «= AB u AE, and AB n AB ** <}>, so P[A] = P[/1B] + 
PlAB] //// 

Theorem 19 For every two events A and Be sf, P[/4 u B} 

+ P[B] — P\Aff\ More generally, for events A lt A z , ,A,esf 

PlA l vA l u u AJ~ 

«./ 

+ %££ n* t AjAJ - + (- 1)* + * PlA t A 2 AJ 

proof duB = du KB, and A r\ Kb m <£, so 
P[A yjB}~ P[A] + P[KB] 

~PiA]+P[D}-P[AB] 

The more general statement is proved by mathematical induction (See 
Problem 16) //// 

Theorem 20 If A and Bes/ and Ac B, then P[A 1 s P[D] 

proof B = BA uBK,andBA *= A ,so B => A u BK, and A n BK » 
<i>, hence P[B ] « /’(/<]+ P[BK\ The conclusion follows by noting that 

f}}l 
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Theorem 21 Boole’s inequality If A u A 2 , . . . , A n e sd, then 
P[A X u A 2 u * * * u A n ] < P[Ai] -f P[A 2 ] + • - • + P[A„l 

proof P[A t u A 2 ] = P[A X ] + P[A 2 ] - P[A x A 2 \ < P[A t ] + P[A 2 ], 
The proof is completed using mathematical induction. //// 

We conclude this subsection with one final definition. 

Definition 16 Probability space A probability space is the triplet 
(Q, sd, P[* ]), where H is a sample space, sd is a collection (assumed to be 
an algebra) of events (each a subset of Q), and P[ m ] is a probability func- 
tion with domain sd. //// 

Probability space is a single term that gives us an expedient way to assume 
the existence of all three components in its notation. The three components are 
related; sd is a collection of subsets of D, and P[* ] is a function that has sd as its 
domain. The probability space’s main use is in providing a convenient method 
of stating background assumptions for future definitions, theorems, etc. It also 
ties together the main definitions that we have covered so far, namely, definitions 
of sample space, event space, and probability. 


3.5 Finite Sample Spaces 

In previous subsections we formally defined sample space, event, and probability, 
culminating in the definition of probability space. We remarked there that these 
formal definitions did not in themselves enable us to compute the value of the 
probability for an event A , which is our goal. We said that we had to appro- 
priately model the experiment. In this section we show how this can be done 
for finite sample spaces, that is, sample spaces with only a finite number of 
elements or points in them. 

In certain kinds of problems, of which games of chance are notable 
examples, the sample space contains a finite number of points, say N = N(Q). 
[Recall that N(A) is the size of A , that is, the number of sample points in A .] 
Some of these problems can be modeled by assuming that points in the sample 
space are equally likely. Such problems are the subject to be discussed next. 

Finite sample space with equally likely points For certain random 
experiments there is a finite number of outcomes, say N, and it is often realistic 
to assume that the probability of each outcome is l/N. The classical definition 
of probability is generally adequate for these problems, but we shall show how 
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the axiomatic definition is applicable as well Let «„ aq , , a N be the N 

sample points in a finite space ft Suppose that the set function PI ] with 
domain the collection of all subsets of ft satisfies the following conditions 

CO J’lKH-J’MJ- 

(n) If A is any subset of ft which contains N(A) sample points [has size 
N(A)], then P[A] « N(A)!N 

Then it is readily checked that the set function P[ ] satisfies the three axioms 
and hence is a probability function 

Definition 17 Equally likely probability function The probability func 
tion P[ ] satisfying conditions (i) and (n) above is defined to be an equally 
likely probability function ffff 


Given that a random experiment can be realistically modeled by assuming 
equally likely sample points, the only problem left in determining the value of 
the probability of event A is to find iV(ft) *= N and N(A) Strictly speaking this 
is just a problem of counting — count the number of points in A and the number 
of points in ft . _ , 

nemd 


EXAMPLE 17 Consider the experiment of tossing two dice (or of tossing one 
die twice) Let ft = {(A, ij) q = 1,2, ,6 ,i 2 = 1,2, , 6) Here q - 

number of spots up on the first die, and q = number of spots up on the sec- 
ond die There 3rc66 = 36 sample points It seems reasonableto attach 
the probability of ^ to each sample point ft can be displayed as a lattice 

as in Fig 2 Let A-, =?= event that the total is 7, then A-, — {(1, 6), (2, 5), 
(3, 4), (4 3), (5, 2), (6, 1)} , so A^) = 6, and P[A,\ = N(A 7 )fN(Q) = ^ = 
£ Similarly P[Aj] can be calculated for Aj>= total of / ,j = 2, ,12 In 

this example the number of points in any event A can be easily counted, 
and so P[A\ can be evaluated for any event A //// 

If N(A) and A(Q) are large for a given random experiment with a finite 
number of equally likely outcomes, the counting itself can become a difficult 
problem Such counting can often be facilitated by use of certain combinatorial 
formulas, some of which wifi be developed now 

Assume now that the experiment is of such a nature that each outcome 
can be represented by an n tuple The above example is such an experiment, 
each outcome was represented by a 2 tuple As another example, if the ex- 
periment is one of drawing a sample of size n, then n tuples are particularly 
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useful in recording the results. The terminology that is often used to describe 
a basic random experiment known generally by sampling is that of balls and urns. 
It is assumed that we have an urn containing, say, M balls, which are numbered 
1 to M. The experiment is to select or draw balls from the urn one at a time 
until n balls have been drawn. We say we have drawn a sample of size n. The 
drawing is done in such a way that at the time of a particular draw each of the 
balls in the urn at that time has an equal chance of selection. We say that a 
ball has been selected at random. Two basic ways of drawing a sample are 
with replacement and without replacement , meaning just what the words say. A 
sample is said to be drawn with replacement, if after each draw the ball drawn 
is itself returned to the urn, and the sample is said to be drawn without replace- 
ment if the ball drawn is not returned to the urn. Of course, in sampling without 
replacement the size of the sample n must be less than or equal to M, the original 
number of balls in the urn, whereas in sampling with replacement the size of 
sample may be any positive integer. In reporting the results of drawing a sample 
of size ;z, an /z-tuple can be used; denote the n-tuple by (z l9 ..., z n ), where z f 
represents the number of the ball drawn on the rth draw. 

In general, we are interested in the size of an event that is composed of 
points that are /z-tuples satisfying certain conditions. The size of such a set can be 
computed as follows : First determine the number of objects, say N l9 that may be 
used as the first component. Next determine the number of objects, say N z , 
that may be used as the second component of an n-tuple given that the first com- 
ponent is known. (We are assuming that N 2 does not depend on which 
object has occurred as the first component.) And then determine the number of 
objects, say N 3 , that may be used as the third component given that the first 
and second components are known. (Again we are assuming N z does not 

i 
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depend on which objects have occurred as the first and second components } 
Continue in this manner until jV„ is determined. The size N(A) of the set A of 
rt tuples then equals N % N t A, 


EXAMPLE 18 The total mimber of different ordered samples of n balls that 
can be obtained by drawing balls from an urn containing M distinguish* 
able balls (distinguished by numbers 1 to M) is Af if the sampling is done 
-with replacement and is Af(Af — I) (Af — n + I) if the sampling is 
done without replacement An ordered sample can be represented by 
ann tuple, say (z,, .z^wherezyisthenumberoftheballobtamedonthe 
yth draw and the total number of different ordered samples is the same as 
the total number of rt tuples In sampling with replacement, there are A/ 
choices of numbers for the first component, A/ choices of numbers for the 
second component, and finally M choices for the rrth component Thus 
there are M' such n tuples In sampling without replacement, there are 
A/ choices of numbers for the first component, M - 1 choices for the 
second, A/ — 2 choices for the third, and finally Af — n + 1 choices for 
the nth component In total, then, there are Af(A/ — I)(Af- 2} 

(A f-n + 1) such rt tuples M(Al - 1) (A/ - rt + I) is abbreviated 

(A/), (see Appendix A) //// 


EXAMPLE 19 Let S be any set containing A/ elements How many subsets 
does S have? First let us determine the number of subsets of size n that 
S has Let x„ denote this number, that is, the number of subsets of S of 
size rt A subset of size n is a collection of n objects, the objects not 
arranged in any particular order For example the subset (jj, z Jt 47 } is 
the same as the subset {j 3 , , 4?} since they contain the same three objects 
If we take a given subset of S which contains n elements, rtl different 
ordered samples can be obtained by sampling from the given subset 
without replacement If for each of the x„ different subsets there are rt 1 
different ordered samples of size «, then there are (rt')x m different ordered 
samples of size n in sampling without replacement from the set S of A/ 
elements But we know from the previous example that this number is 
(A/),, hence (/i*)** = (A/),, or 


x, = ~~ = * number of subsets of $ize rt that may be formed 

from the elements of a set of size M ( 1 ) 
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The total number of subsets of S, where Sis a set of size M, is £ ( M \ 

This includes the empty set (set with no elements in it) and the whole set, 
both of which are subsets. Using the binomial theorem (see Appendix 
A) 



y— l 

Ii 

Ii 

C3 

£ 


n»0\W/ 

wc see that 


(2) 

thus a set of size M has 2 M subsets . 


//// 


EXAMPLE 20 Suppose an urn contains M balls numbered 1 to M> where the 
first K balls arc defective and the remaining M — K arc nondefective. 
The experiment is to draw n balls from the urn. Define A k to be the event 
that the sample of n balls contains exactly k defectives. There are two 
ways to draw the sample: (i) with replacement and (ii) without replace- 
ment. We are interested in P[A k ] under each method of sampling. Let 
our sample space Q = {(z x , z„): ij = number of the ball drawn on the 

yth draw}. Now 


P[A k ] « 


AH) 

mv 


From Example 18 above, we know N(fl) = M n under (i) and N(Q) = (M) n 
under (ii). A k is that subset of Q for which exactly k of the zf s arc ball 
numbers 1 to K inclusive. These k ball numbers must fall in some subset 
of k positions from the total number of n available positions. There are 

^ ways of selecting the k positions for the ball numbers 1 to K inclusive to 

fall in. For each of the different positions, there are K l (M - K) n ~ k 
different n-tuples for case (i) and {K) k (M - K) n - k different w-tuples for 
case (ii). Thus A k has size ^jK\M - K) n ~ k for case (i) and size 


Q(JQa(JW - K)„. k for case (ii); so, 


PlA k ] = 


Jr 


( 3 ) 
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in sampling with replacement, and 

P[A* 1 


W* 


( 4 ) 


in sampling wnhout replacement This latter formula can be rewritten 
as 


J*Md- 



( 5 ) 


It might be instructive to derive Eq (S) m another way Suppose 
that our sample space, denoted now by ft , is made up of subsets of sue 
n rather than n tuples, that is, £1 «{{zi, ,z,,) z, ,z, ate the numbers 

on the n balls drawn) There are subsets of size n of the M balls, 

so A^ft*) ** If it is assumed that each of these subsets of size n is 

just as likely as any other subset of size n (one can think of selecting all n 
balls at once rather than one at a time) then P(<4 k ] = N{A^IN{QT\ Now 
N(A k ) is the size of the event consisting of those subsetsof size/i which con 
tarn exactly k balls from the balls that are numbered 1 to AT inclusive 
The k balls from the balls that are numbered l to K can be selected tn 

^ j ways, and the remaining n — k balls from the balls that are numbered 
K + 1 to M can be selected in ways hence N(A k ) = 


pfdj = N(/4*)/N(n) = 



We have derived the probability of exactly k defectives in sampling 
without replacement by considering two different sample spaces, one 
sample space consisted of n tuples, the other consisted of subsets of size 
n 

To aid in remembering the formula given in Eq (5), note that 
K + A/— K = M and k + n — k =■ n, I e , the sum of the “upper " terms 
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in the numerator equals the “upper” term in the denominator, and the 
sum of the “lower” terms in the numerator equals the “lower” term 
in the denominator. //// 


EXAMPLE 21 The formula given in Eq, (5) is particularly useful to calculate 
certain probabilities having to do with card games. For example, we 
might ask the probability that a certain 13-card hand contains exactly 
6 spades. There are M = 52 total cards, and one can model the card 
shuffling and dealing process by assuming that the 13-card hand represents 
a sample of size 13 drawn without replacement from the 52 cards. Let 
A 6 denote the event of exactly 6 spades. There are a total of 13 spades 
(defective balls in sampling terminology); so 



by Eq. (5). 


1111 


Many other formulas for probabilities of specified events defined on finite 
sample spaces with equally likely sample points can be derived using methods of 
combinatorial analysis, but we will not undertake such derivations here. The 
interested reader is referred to Refs. 10 and 8. 

Finite sample space without equally likely points We saw for finite sample 
spaces with equally likely sample points that P[A ] = N(A)/N(Ci) for any event A. 
For finite sample spaces without equally likely sample points, things are not 
quite as simple, but we can completely define the values of P[A] for each of the 
2 W(n ’ events A by specifying the value of /*[•] for each oy the N = N(Q) elemen- 
tary events. Let & = {«!,..., a) N ], and assume p } =P[{c0j}] for / » 1, . . . , N. 
Since 

i - «q] - p[uw] - J^iwi. 

U ml ; 

For any event A, define P[A] = Zpj, where the summation is over those <Oj 
belonging to A. It can be shown that F[- ] so defined satisfies the three axioms 
and hence is a probability function. 
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EXAMPLE 22 Consider an experiment that lias iV outcomes, say , 

o}^, where it is known that outcome w j+l is twice as likely as outcome 
co^, where j= l, , N — 1 , that is = 2p } , where p, = P[{co,}] Fmd 
P[Af\, where A k ~ {aij , <a 2 , , «*) Since 


% Pj “ + 2 + + + 2W “') = J>i(2 W - 1) - 1. 


5>j= £2' V(2 N - 


III! 




Conditional Probability and Independence 

In the application of probability theory to practical problems it is not infrequent 
that the experimenter u confronted with the following situation Such and such 
has happened, now what is the probability that something else will happen? 
For example, in an experiment of recording the life of a light bulb, one might 
be interested in the probability that the bulb will last 100 hours given that it 
has already lasted for 24 hours Or in an experiment of sampling from a box 
containing 100 resistors of which 5 art defective, what is the probability that 
the third draw results in a defective given that the first two draws resulted in 
defectives? Probability questions of this sort are considered in the framework 
of conditional probability, the subject that we study next 

^^Conditional probability We begin by assuming that we have a probability 
space, say {Q, si, P[ ]), that js, we have at hand some random experiment for 
which a sample space ft, collection of events si, and probability function 
JT T have alLheCP defined 


Given two events A and B, we want to define the conditional probability 
of event A given that event B has occurred 


Definition 18 Conditional probability Let A and B be two events mj/ 
of the given probability jpace (fi si, P[ ]) The conditional probability— 

ofevebtvl -given eve nt B t denoted by /’[^|ftj,is defined by 


f| P[A[B\ 


and is left undefined it P[B ] » 0 


PUB] 


' flB] 


P[B] > 0, 


( 6 ) 

UK 
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Remark A formula that is evident from the definition is P[AB] = 
P[A\B]P[B] = P[B\A]P[A] if both P[A] and P[B] are nonzero. This 
formula relates P[A\B] to P\B\A] in terms of the unconditional prob- 
abilities P[A] and P[B]. //// 


We might note that the above definition is compatible with the frequency 
approach to probability, for if one observes a large number, say A, of occur- 
rences of a random experiment for which events A and B are defined, then 
P[A | B] represents the proportion of occurrences in which B occurred that A also 
occurred, that is, 

P[A |B] = ~, 

W B 


where N D denotes the number of occurrences of the event B in the N occur- 
rences of the random experiment and N AB denotes the number of occurrences 
of the event A n B in the N occurrences. Now P[AB] = N AD /N , and P[B] = 
N B /N;so 


B[B] NJN N, ' J ' 


consistent with our definition. 


EXAMPLE 23 Let Cl be any finite sample space, sA the collection of all subsets 
of Cl, and />[•] the equally likely probability function. Write N — A '(Cl). 
For events A and B, 


P[A\B] = 


P[AB] 

pm 


N(AB)/N 
N(B)/N ’ 


where, as usual, N(B) is the size of set B. So for any finite sample space 
with equally likely sample points, the values of P[A \ B] are defined for any 
two events A and B provided P[B] > 0. //// 


EXAMPLE 24 Consider the experiment of tossing two coins. Let H = 
{(H, H), (H, T), (T, H), (T, T)}, and assume that each point is equally 
likely. Find (i) the probability of two heads given a head on the first 
coin and (ii) the probability of two heads given at least one head. Let 
A i = {head on first coin} and A 2 = (head on second coin}; then the prob- 
ability of two heads given a head on the first coin is 


p r . , , „ P\AAM P\AxAj\ _b_\ 
P[A x A 2 \A l \~ p[Ai] p[Ai] i r 
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The probability of two heads given at least one head is 


P[A t A 2 \A, 


F[A l A l r^{A i ijA 2 ^ iftMJ \ 1 
PlAiuAA m nA l uAA m i~3 


We obtained numerical answers to these two questions, but to do so we 
had to model the experiment, we assumed that the four sample points 
were equally likely 


When speaking of conditional probabilities we are conditioning on some 
given event B, that is, we are assuming that the experiment has resulted m some 
outcome in B B, in effect, then becomes our “ new " sample space One ques 
tion that might be raised is For given event 8 for which P[B) > 0, is P[ Iff] a 
probability function having si as its domain "* In other words does P[ |B] 
satisfy the three axioms 7 Note that 

(0 P[A | JB] = P[AB]!P [B) > 0 for every A e si 
(m) P[n | B] - p[aByp[B] = P[Byp[B) = i 

(in) If Ait A 2 is a sequence of mutually exclusive events in si and 


LI XjSaf, then 


p[\Jaab] 


p [tO/'H p [A (y4,B) ] 


P[B) 


P[B] 


Y.PIA,\B] 


Hence, P[ |B) for given B satisfying P[B] > 0 is a probability function which 
justifies our calling it 3 condilmnal probability P{ jB] also enjoys the same 
properties as the unconditional probability The theorems listed below are 
patterned after those in Subsec 3 4 


Properties of P[- 1 B] Assume that the probability space (Q, a/, P[ ]) is given, 
and let Be jl satisfy P[B] > 0 

Theorem 22 F[$\B]*=0 //// 

Theorem 23 If A lt , A„ are mutually exclusive events in si, then 

P\A t yj III! 

1*1 

Theorem 24 If A is an event in si, then 

P{^iB]=l-P{d|B] 


m 
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Theorem 25 If A x and A 2 e st, then 

P[ Al | B] = P[ Al A 2 1 B] + P[ Al A 2 1 B]. HU 

Theorem 26 For every two events A x and A 2 e sf, 

P{A t u A 2 1 B] = P[ Al | B] + P[A 2 1 B] ~ P[ Al A 2 [ B]. HU 

Theorem 27 If A t and A 2 estf and A x <= A 2 , then 

P[ Al \B]<P[A 2 \Bl mi 

Theorem 28 If A i9 A 2 , ...,A n es#> then 

P\A i u A 2 u - u A n \B\ < £p[Ai \B]. Hit 

i=i 

Proofs of the above theorems follow from known properties of P[-] and 
are left as exercises. 

There are a number of other useful formulas involving conditional prob* 
abilities that we will state as theorems. These will be followed by examples. 


Theorem 29 Theorem of total probabilities For a given probability 
space (f X sf, ?[•]), if B u B 2 » B n is a collection of mutually disjoint 

n 

events in rf satisfying fi = (J Bj and P[Bj] >0 for j= 1, n, then 

j-i 

for every A erf, P[A] = ]T P[A \ Bj]P[Bj]. 

j= i 


proof Note that A = Q AB, and the AB/s are mutually disjoint; 
J = i 


hence 


P[A] = P 


U = I = £ P[A\Bj]P[Bjl 

i=i J j=i j= i 


1111 


Corollary For a given probability space (fl, rf, P[' ]) let Berf 
satisfy 0 < P[B] < 1 ; then for every A erf 

P[A] = | B]P[B] + P[A | HU 


Remark Theorem 29 remains true if n — oo. 


/Ill 
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Theorem 29 (and its corollary) is particularly useful for those experiments 
that have stages, that is, the experiment consists of performing first one thing 
(first stage) and then another (second stage) Example 25 provides an example 
of such an experiment, there, one first selects an urn and then selects a ball 
from the selected urn For such experiments, if Bj is an event defined only in 
terms of the first stage and A is an event defined m terms of the second stage, 
then it may be easy to find PJB,], also, it may be easy to find P[A jJJjl, and then 
Theorem 29 evaluates P{A\ in terms of and P[A\Bj\ forj = I, .,n In 
an experiment consisting of stages it is natural to condition on results of a first 
stage 


Theorem 30 Bayes formula For a given probability space (fl,.s/,P[]), 
if , B, is a collection of mutually disjoint events in d satisfying 

H = 0 Bj and P[Bj) > 0 fory = 1, , rt, then for every A e si for which 

j-i 

P[A]>B 

— L 

jtWBJPlBjl 

PROOF 

- -pT7T " 

BjlHBjl 

by using both the definition of conditional probability and the theorem of 
total probabilities //// 


Corollary For a given probability space (Q, si, P[ J) let A and B 6 si 
satisfy P[A] > 0 and 0 < P{B\ < 1 , then 


P[B\A\ 


BU| B]P[B) 

P[A[B)P[B] + P[A\B]P[B] 


an 


Remark Theorem 30 remains true if n = to fill 

As was the case with the theorem of total probabilities, Bayes’ formula is 
also particularly useful for those experiments Consisting of stages If B } , 
J = 1 , , n, is an event defined in terms of a first stage and A is an event defined 

in terms of the whole experiment including a second stage, then ashing for 
P[B k \A\ is in a sense backward, one is asking for the probability of an event 
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defined in terms of a first stage of the experiment conditioned on what happens 
in a later stage of the experiment. The natural conditioning would be to con- 
dition on what happens in the first stage of the experiment, and this is precisely 
what Bayes’ formula does; it expresses P[B k \A] in terms of the natural con- 
ditioning given by P[A\Bj] and P[Bj\J=* 1 , 

Theorem 31 Multiplication rule For a given probability space 
(fi, sf, P[*]), let A u ...,A n be events belonging to sf for which 
P[A t A n „i] >0; then 

P[A t A 2 •••/!„]= PlA^PlAilAJPlA^AiAi] • • • P[A n \A t — 

proof The proof can be attained by employing mathematical 
induction and is left as an exercise. //// 

As with the two previous theorems, the multiplication rule is primarily 
useful for experiments defined in terms of stages. Suppose the experiment has 
n stages and A j is an event defined in terms of stage j of the experiment; then 
P[A J \A 1 A 2 Aj- i] is the conditional probability of an event described in 
terms of what happens on stage j conditioned on what happens on stages 
1, 2, 1. The multiplication rule gives P[A 1 A 2 "'A n ] in terms of 

the natural conditional probabilities P[A j \A l A 1 Aj-x] for j = 2, n. 


EXAMPLE 25 There are five urns, and they are numbered 1 to 5. Each 
urn contains 10 balls. Urn i has i defective balls and 10 — i nondefective 
balls, i« 1, 2, 5. For instance, urn 3 has three defective balls and 

seven nondefective balls. Consider the following random experiment: 
First an urn is selected at random, and then a ball is selected at random 
from the selected urn. (The experimenter does not know which urn was 
selected.) Let us ask two questions: (i) What is the probability that a 
defective ball will be selected? (ii) If we have already selected the ball 
and noted that it is defective, what is the probability that it came from 
urn 5? 

solution Let A denote the event that a defective ball is selected and 
B { the event that urn / is selected, / = 1, 5. Note that P[B { ] =|, 

/ = 1 , . . . , 5, and P[A \B t ] = i/\Q, i = 1 , . . . , 5. Question (i) asks, What is 
P[A}7 Using the theorem of total probabilities, we have 

P[A] = 1 = ( X Jo ‘ 5 = 50 /?i‘ = 50 T = lo' 
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Note that there is a total of 50 balls of which 15 are defectivel Question 
(u) asks, What is P[B i \A]'> Since urn 5 has more defect i\e balls than 
any of the other urns and we selected a defective ball, we suspect that 
/’[BjIX] >P{By\A\ for / * » 1, 2, 3, or 4 In fact, we suspect 
> ?[8 4 |/1] > > /’[Bj j/1] Employing Bayes* formula, we find 


Similarly, 


JWM)> 


nAlBJPlB,) 


P[BM\ 


. (k/10) 


k_ 

* 15' 


ULi 

-fV 3 


t, >5, 


substantiating out suspicion Note that unconditionally all the s were 
equally likely whereas, conditionally (conditioned on occurrence of event 
A), they were not Also, note that 




_56 = 
15 2 = 


lilt 


EXAMPLE 26 Assume that a student is taking a multiple-choice test On a 
given question, the student either knows the answer, in which case he 
answers it correctly, or he does not know the answer, in which case he 
guesses hoping to guess the right answer Assume that there are five 
multiple-choice alternatives, as is often ’he case The instructor is con* 
fronted with this problem Having observed that the student got the 
correct answer, he wishes to know what is the probability that the student 
knew the answer Let p be the probability that the student will know the 
answer and 1 —/> the probability that the student guesses Let us assume 
that the probability that the student gets the right answer given that he 
guesses is \ (This may not bearealisticassumption since even though the 
student does not know the right answer, he often would know that certain 
alternatives are wrong, in which case his probability of guessing correctly 
should be better than |) Let A denote the event that the student got the 
right answer and B denote the event that the student knew the righ t answer 
We are seeking P[B\A) Using Bayes’ formula, we have 

„ BUi „ u 

1 11 PIA ) + P[A ] 1 P+ -1(1 -/>) 

Note that 


P + iU-p) 


III! 
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EXAMPLE 27 An urn contains ten balls of which three are black and seven 
are white. The following game is played: At each trial a ball is selected 
at random, its color is noted, and it is replaced along with two additional 
balls of the same color. What is the probability that a black ball is 
selected in each of the first three trials? Let B t denote the event that a 
black ball is selected on the /th trial. We are seeking P[B i B 2 B 3 ]. By the 
multiplication rule, 

WA B 3 ] - P[B X ]P[B Z | B l ]P[B 2 1 B,B 2 ] «*•*•*«* II II 


EXAMPLE 28 Suppose an urn contains M balls of which K are black and 
M *- K are white. A sample of size n is drawn. Find the probability 
that the Jth ball drawn is black given that the sample contains k black 
balls. (We intuitively expect the answer to be k/n.) We have to con- 
sider sampling (i) with replacement and (ii) without replacement. 


solution Let A k denote the event that the sample contains exactly 
k black balls and Bj denote the event that the /th ball drawn is black. 
We seek P[Bj\A k ]. Consider (i) first. 




n\ K*(M-K) n - k 
k) M n 


and 




M n 


by Eq. (3) of Subsec. 3.5. Since the balls are replaced, P[Bj] = K/M for 
any j. Hence, 


(l J \[K k ~\M . 

_ P[A k \Bj]P[Bj] = yfe-V = 

Q K\M - K) n ~ k IM n ” 


w-iS 

For case (ii), 


© 


1 1 


}'m' 

1 


and P[A k \Bj} = 


g=!)(L-fl 




by Eq. (5) of Subsec. 3.5. P[Bj] = £ /’[B J |C,]P[C|] > where C, denotes 

f-0 

the event of exactly i black balls in the first / — 1 draws. Note that 

©0-T*) 


P[C,] = - 


(/,) 
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I 


and 


and so 




K—i 
M-j + l* 


Finally, 


PiBjl = 


f *rl- 



K 

M 


PLBjlAJ 


piaa 


(x--m ~' n 


Thus we obtain the same answer under either method of sampling //// 


Independence of events If R(/t|#] does not depend on event B t that is, 
P[A = P[A], then it would seem natural to say that event A is independent 
of event B This is given in the following definition 


Definition 19 Independent events For a given probability space 
(£1, tf, P[ ]), let A and B be two events in sd Events A and B are 
defined to be independent if and only if any one of the following conditions 
is satisfied 

(o p[AB]~pum 

(h) P[A | B J *= P[A] if FIB} > 0 

(ui) P{B\A\~P[B}\fP[A\>0 fill 

Remark Some authors use “statistically independent,” or “stochasti- 
cally independent,” instead of “ independent ” //// 

Tm.ynifiJthr epmvaknce nf the .above three crmdituuLV-'t -suffices to jiho w 
that (i) implies («), (ii) implies (m), and (in) implies (i) If P[AB] « P{A]P[3], 
then P[A\B\ « P[AR]JP[B\ = P[A\P[B\(P[B\ m P[A\ for R{Bl > Q,so (i) implies 
in) If R[<4|B] * P\A\ then P\B)A] = ?[A\E)P[B]}P{A) - P[A]P[B]}P[A \ = 
R(BJ for P[A] > 0 and R(B] > 0, so (n) implies (m) And if R{B[<4 1 
then P[AB]^P[B\A]P[A}^P[B)P[A] for F[4)>0 Clearly P[AB\ * 
if P[A] = 0 or P[Bl - 0 
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One might inquire whether all the above conditions are required m the 
definition For instance, does P[A t A 2 A 3 ] = imply P[/t,dj] 

= flAtlFlAJ* Obviously not, since PIA,A 2 A 3 ) = if FMiI 

=» 0, but P[A 2 A 2 ] ^ PlAi]P[A 2 ] if A t and A 2 are not independent Or does 
pairwise independence imply independence? Again the answer is negative, 
as the following example shows 


EXAMPLE 30 Pairwise independence does not imply independence Let A x 
denote the event of an odd face on the first die, A 2 the event of an odd face 
on the second die and A 3 the event of an odd total in the random experi- 
ment that consists of tossing two dice P[A i ]PlA 2 'J = } i = PlA t A } ] 
P[A,]P[A 3 ] = i l-PlAylAJPMJ^PlAM and P[A 2 A 2 }~ J- 
P[A 2 ]P[A 2 \, so A lf A 2 and A } are pairwise independent However 
PWi^iAi] = 0 /• | = P[A,]P[A 2 ]P[A 1 ], so A„ A 2 , and Ay are not 
independent //// 


In one sense, independence and conditional probability are each used to find 
the same thing namely, PlAB], for P[AB] «s P[A)P[B) under independence and 
P[AB] = P[A \B]P[B] under nonindependence The nature of the events A and 
B may make calculations of P[A] P[B] and possibly P[A\B] easy, but direct 
calculation of PlAB] difficult in which case our formulas for independence or 
conditional probability would allow us to avo d the difficult direct calculation 
of P[AB\ We might note that P[AB] = P[d(0]P(BJ is valid whether or not A 
is independent of B provided that P[A\B\ is defined 

The definition of independence is used not only to check if two given events 
are independent but also to model experiments For instance, for a given 
experiment the nature of the events A and B might be such that we are willing 
to assume that A and Bare independent, then the definition of independence gives 
the probability of the event A c\ B in terms of /’[/!) and P\B] Similarly for 
more than two events 


EXAMPLE 31 Consider the experiment of sampling with replacement from 
an uru containing M balls of which K are black and M — A white Since 
balls arc being replaced after each draw, it seems reasonable to assume that 
the outcome of the second draw ts independent of the outcome of the 
first Ttwn P[two blacks in first two draws] 

P(btacK on first draw]P[black on second draw] ** (A/W) 1 //// 
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PROBLEMS 

To solve some of these problems it may be necessary to make certain assumptions, 
such as sample points are equally likely, or trials are independent, etc., when such 
assumptions are not explicitly stated. Some of the more difficult problems, or those 
that require special knowledge, are marked with an *. 

1 One urn contains one black ball and one gold ball. A second urn contains one 
white and one gold ball. One ball is selected at random from each urn. 

(a) Exhibit a sample space for this experiment. 

( b ) Exhibit the event space. 

(c) What is the probability that both balls will be of the same color? 

(d) What is the probability that one ball will be green? 

2 One urn contains three red balls, two white balls, and one blue ball. A second 
urn contains one red ball, two white balls, and three blue balls. 

(a) One ball is selected at random from each urn. 

(i) Describe a sample space for this experiment. 

(ii) Find the probability that both balls will be of the same color. 

(iii) Is the probability that both balls will be red greater than the prob- 
ability that both will be white? 

(b) The balls in the two urns are mixed together in a single urn, and then a sample 
of three is drawn. Find the probability that all three colors are represented, 
when (i) sampling with replacement and (ii) without replacement. 

3 If A and B are disjoint events, P[A] =.5, and P[A u5] = .6, what is F[jB]? 

4 An urn contains five balls numbered 1 to 5 of which the first three are black and 
the last two are gold. A sample of size 2 is drawn with replacement. Let B x 
denote the event that the first ball drawn is black and B 2 denote the event that the 
second ball drawn is black. 

(a) Describe a sample space for the experiment, and exhibit the events B t , B 2 , 
and BiB 2 . 

(b) Find PfBil P[B 2 l and P[B t B 2 J. 

(c) Repeat parts (a) and ( b ) for sampling without replacement. 

5 A car with six spark plugs is known to have two malfunctioning spark plugs. 
If two plugs are pulled at random, what is the probability of getting both of 
the malfunctioning plugs? 

6 In an assembly-line operation, i of the items being produced are defective. If 
three items are picked at random and tested, what is the probability: 

(a) That exactly one of them will be defective? 

(b) That at least one of them will be defective? 

7 In a certain game a participant is allowed three attempts at scoring a hit. In the 
three attempts he must alternate which hand is used; thus he has two possible 
strategies: right hand, left hand, right hand; or left hand, right hand, left hand. 
His chance of scoring a hit with his right hand is .8, while it is only .5 with his 
left hand. If he is successful at the game provided that he scores at least two hits 
in a row, what strategy gives the better chance of success? Answer the same 
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question if 8 is replaced by pi and 5 by p 3 Does your answer depend on p, 
and^j? 

(а) Suppose that A and B are two equally strong teams Is it more pro babe 
that A will beat B in three games out of four or in five games out of seven’ 

(б) Suppose now that the probability that A beats B in an individual game up 
Answer part (o) Does your answer depend on pi 

9 If F{A\ = \ and FIS] = i can A and B be disjoint*? Explain 

10 Prove or disprove If PM] ** P{B ] =p, then fl AB] S/>* 

11 Prove or disprove If PlA) = P[B] then X •= B 

12 Prove or disprove If PM] ■= 0, then A ^ 

13 Prove or disprove If PI A] — 0 then P[AB) =0 

14 Prove IfPMl«=«and r[B)^P then P[AB\ ^ 1 — * —{S 

15 Prove properties (i) to (iv) of indicator functions 

16 Prove the more general statement in Theorem 19 

17 Exhibit (if such exists) a probability space denoted by (O, a/, P[ ]) which satisfies 
the following For A t and A, members ohuf if P[A,) = P[A t ] then A t <■ A, 

1$ Four drinkers (say I, II 111 anil IV) are to rank three different brands of beer 
(say A, B, and C) in a blindfold test Each drinker ranks the three beers as I 
(for the beer he likes best) 2 and 3 3nd then the assrgned ranks of each brand 
of beer are summed Assume that the drinkers really cannot discriminate between 
beers so that each is assigning his rankings at random 

(а) What is the probability that beer A will receive a total score of 4 ’ 

(б) What is the probability that some beer wilt receive a total score of 4? 

(c) What is the probability that some beer will receive a total score of S or less? 

19 The following are three of the classical problems m probability 

(n) Compare the probability of a total of 9 with a total of 10 when three faff 
dice are tossed once (Galileo and Duke of Tuscany) 

(, b ) Compare the probability of at least one 6 in 4 tosses of a fair die with the 
probability of at least one double 6 in 24 tosses of two fair dice (Chevalier 
de Mdrd) 

(r) Compare the probability of at least one 6 when six dice are rolled with the 
probability of at least two 6s when twelve dice are rolled (Pepys to Newton) 

20 A seller has a dozen small electric motors iwo of which are faulty Acustomeris 
interested in the doien motors The seller can crate the motors with all twelve in 
one box or with Six tn t3ch of two boxes he knows that the customer will inspect 
two of the twelve motors if they are all crated in one box and one motor from each 
of the two smaller boxes if they are crated six each to two smaller boxes He 
h3$ three strategies in his attempt to sell the faulty motors (i) crate all twelve 
in one box. (») put one faulty motor in each of the two smaller boxes, or (ui) put 
both of the faulty motors in one of the smaller boxes and no faulty motors in the 
other What Is the probability that the customer will not inspect a faulty motor 
under each of the three strategies? 
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21 A sample of five objects is drawn from a larger population of N objects (N ;> 5). 
Let A* or A r w o denote the number of different samples that could be drawn 
depending, respectively, on whether sampling is done with or without replacement. 
Give the values for A r * and A^ w o . Show that when N is very large, these two values 
are approximately equal in the sense that their ratio is close to 1 but not in the 
sense that their difference is close to 0. 

22 Out of a group of 25 persons, what is the probability that all 25 will have different 
birthdays? (Assume a 365-day year and that all days are equally likely.) 

23 A bridge player knows that his two opponents have exactly five hearts between 
the two of them. Each opponent has thirteen cards. What is the probability 
that there is a threc-two split on the hearts (that is, one player has three hearts 
and the other two)? 

24 (a) If r balls are randomly placed into n urns (each ball having probability 1/// 

of going into the first urn), what is the probability that the first urn will 
contain exactly k balls? 

( b ) Let //-* co and r~*oo while r/n — tn remains constant. Show that the 
probability you calculated approaches c“ m m x lkl . 

25 A biased coin has probability p of landing heads. Ace, Bones, and Clod toss the 
coin successively, Ace tossing first, until a head occurs. The person who tosses 
the first head wins. Find the probability of winning for each. 

*26 It is told that in certain rural areas of Russia marital fortunes were once told in the 
following way: A girl would hold six strings in her hand with the ends protruding 
above and below; a friend would tie together the six upper ends in pairs and then 
tie together the six lower ends in pairs. If it turned out that the friend had tied 
the six strings into at least one ring, this was supposed to indicate that the girl 
would get married within a year. What is the probability that a single ring will 
be formed when the strings are tied at random? What is the probability that at 
least one ring will be formed? Generalize the problem to 2 // strings. 

27 Mr. Bandit, a well-known rancher and not so well-known part-time cattle rustler, 
has twenty head of cattle ready for market. Sixteen of these cattle are his own 
and consequently bear his own brand. The other four bear foreign brands. Mr. 
Bandit knows that the brand inspector at the market place checks the brands of 
20 percent of the cattle in any shipment. He has two trucks, one which will haul 
ail twenty cattle at once and the other that will haul ten at a time. Mr. Bandit 
feels that he has four different strategies to follow in his attempt to market the 
cattle without getting caught. The first is to sell all twenty head at once; the 
others are to sell ten head on two different occasions, putting all four stolen cattle 
in one set often, or three head in one shipment and one in the other, or two head in 
each of the shipments of ten. Which strategy will minimize Mr. Bandit’s prob- 
ability of getting caught, and what is his probability of getting caught under each 
strategy? 

28 Show that the formula of Eq. (4) is the same as the formula of Eq. (5). 
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29 Prove Theorem 31 

30 E (her prove or disprove each of the following (you may assume that none of the 
events has zero probab Uty) 

(u) If P[A\B] > PIA] thenP[flM]>PfBJ 

(b) If P[A] >f*IZ?} then/>I/t]C]>/*[B|C] 

31 A certain computer program will operate using either of two subroutines say 
and B depending on the problem experience has shown that subroutine A will be 
used 40 percent of the time and B will be used 60 percent of the time. If A is 
used then there is a 7S percent probability that the program wilt run before its 
time limit is exceeded and if B is used there is a 50 percent chance that it will do 
so What is tl c probab hty that the program will run without exceeding the time 
limit 11 

32 Suppose that it is known that a fraction 001 of the people in a town have tuber 
culosis (TB) A tuberculosis test is given with the following properties If the 
person docs have TB the test will indicate it with a probability 999 If he does 
not have TB then there is a probability 002 that the test will erroneously indicate 
that he does For one randomly selected person the test shows that he haa 
TB What is the probability that he really does ? 

* 33 Consider the experiment of tossing two fair regular tetrahedra (a polyhedron with 
four faces numbered 1 to 4) and noting the numbers on the downturned faces 
(a) Give three proper events (an event A is proper if 0 < P{A \ < I ) which are 
independent (if such exist) 

(fc) Give three proper events which are pairwise independent but not independent 
(iT such exist) 

(r) Give four proper events which are independent (if such exist) 

34 Prove or disprove 

( а ) UA and £ arc independent events then P{AB\ C] *» P[A | C]PtJ3[ C] 

(б) If P[A | B\ *= P[B\> then A and B are independent 

35 Prove or disprove 

(o) VPIA\B)^P[A] then P[B\A]-5.P[B] 

(h) If 7*151 3} « J*IB| A] then A and B are independent 

(c) If *i =» HA} and b~PlB) then PIA ) 5) S (a + b - I)/b 

36 Consider an urn containing 10 balls of which 5 are black Choose an integer n 
at random from the set J 2 3 4 5 6 and then choose a sample of size n without 
replacement from the urn Find the probability that all the balls fn the sample 
will be black 

37 A die is thrown as long as necessary for a 6 to turn up Given that the 6 does not 
turn up at the first throw what is the probability that more than four throws will 
be necessary? 

38 D eA has four red and two blue faces and die B has two red and four blue faces 
The following game is played First a coin Is tossed once. If it falls heads, the 
game continues by repeatedly throwing die A if it falls tails die B is repeatedly 
tossed 
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(a) Show that the probability of red at any throw is J. 

(b) If the first two throws of the die resulted in red, what is the probability of red 
at the third throw? 

(c) If red turns up at the first n throws, what is the probability that die A is 
being used? 

39 Urn A contains two white and two black balls; urn B contains three white and 
two black balls. One ball is transferred from A to B; one ball is then drawn 
from B and turns out to be white. What is the probability that the transferred 
ball was white? 

*40 It is known that each of four people A, B , C, and D tells the truth in a given 
instance with probability j. Suppose that A makes a statement, and then D says 
that C says that B says that A was telling the truth. What is the probability 
that A was actually telling the truth? 

41 In a T maze, a laboratory animal is given a choice of going to the left and getting 
food or going to the right and receiving a mild electric shock. Assume that 
before any conditioning (in trial number J) animals are equally likely to go to the 
left or to the right. After having received food on a particular trial, the prob- 
abilities of going to the left and right become .6 and .4, respectively, on the follow- 
ing trial However, after receiving a shock on a particular trial, the probabilities 
of going to the left and right on the next trial are .8 and .2, respectively. What 
is the probability that the animal will turn left on trial number 2? On trial 
number 3? 

* 42 In a breeding experiment, the male parent is known to have either two dominant 
genes (symbolized by AA) or one dominant and one recessive {Ad). These two 
cases are equally likely. The female parent is known to have two recessive genes 
(aa). Since the offspring gets one gene from each parent, it will be either Aa or 
aa, and it will be possible to say with certainty which one. 

(a) If we suppose one offspring is Aa, what is the probability that the male 
parent is A A? 

{b) If we suppose two offspring are both Aa, what is the probability that the 
male parentis AA? 

(c) If one offspring is aa, what is the probability that the male parent is Aa? 
43 The constitution of two urns is 


three black 
two white 


four black 
six white 


A draw is made by selecting an urn by a process which assigns probability p to the 
selection of urn I and probability 1 - p to the selection of urn II. The selection 
of a ball from either urn is by a process which assigns equal probability to all 
balls in the urn. What value of p makes the probability of obtaining a black 
ball the same as if a single draw were made from an urn with seven black and 
eight white balls (all balls equally probable of being drawn)? 
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44 Given P[A J = 5 and P[Au J?] =» 6, find F[B] if 

(а) /I and £ are mutually exclusive 

(б) A and B are independent 
(c) P{A\B\~ 4 

45 Three fair dice are thrown once Given that no two show the same face 

(а) What is the probability that the sum of the faces is 7? 

(б) What is the probability that one is an ace? 

46 Given that P[A] > 0 and F(X?1 > 0, prove or disprove 

(а) I f P[A ] *= ftBU then p[A | B] *= PIB\A] 

(б) If P[A then P[A] « F[S] 

47 Five percent of the people have high blood pressure Of the people with high 
blood pressure, 75 percent drink alcohol, whereas, only 50 percent of the people 
without high blood pressure drink alcohol What percent of the drinkers have 
high blood pressure? 

45 A distributor of watermelon seeds determined from extensive tests that 4 percent 
of a large batch of seeds wd( not germinate He sells the seeds in packages of SO 
seeds and guarantees at least 90 percent germination What ts the probability 
that a given package will violate the guarantee? 

49 If A and B are independent. P[A ) » i, and F[E] *= l. find P[A u B] 

50 Mr Stoneguy, a wealthy diamond dealer, decides to reward his son by allowing 
him to select one of two boxes Each box contains three stones In one box two 
of the stones are real diamonds, and the other is a worthless imitation, and in the 
other box one is a real diamond, and the other two worthless imitations If the 
son were to choose randomly between the two boxes, his chance of getting two 
real diamonds would be i Mr Stoneguy, being a sporting type, allows b» 
son to draw one stone from one of the boxes and to examine it to see if it is a real 
diamond The son decides to take the box that the stone he tested came from if 
the tested stone is reai and to take the other box otherwise Now what is the 
probability that the son will get two real diamonds? 

51 If PM) -P{8]~PIB\A) ■= i, arc A and XI independent? 

52 Jf A and B are independent and F[/4] «= FIS) « J, what is P [A 8 u j?B)? 

53 If FIS) - 1, what is P[ABCtf 

54 If A and J? are independent and P[A] — P[B\A) ■* ), what is FI/4 u 5)? 

55 Suppose S,, Si, and Sj arc mutually exclusive If FIB,] ~ } and P[A\Bi1 “ Jib 
for; -5, 2, 3, what is FJ/4]? 

*56 The game or craps is played by letting the thrower toss two dice until he either win* 
or Joses The thrower wins on the first toss if he gets a total of 7 or 1 1 , he loses 
on the first toss if he gets a total of 2, 3, or 12, If he gets any other total on ha 
first toss, that total is called his point He then tosses the dice repeatedly until he 
obtains a total of 7 or his point He wins if he gets his point and loses if he gets a 
total of 7 What is the thrower’s probability of winning? 

37 In a dice game a player casts a pair of dice twice He wins if the two totals 
thrown do not differ by more than 2 with the following exceptions If he gets a 
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3 on the first throw, he must produce a 4 on the second throw; if he gets an 11 on 
the first throw, he must produce a 10 on the second throw. What is his prob- 
ability of winning? 

58 Assume that the conditional probability that a child bom to a couple will be 
male is \ + rne t —fe 2 , where ei and e 2 are certain small constants, m is the 
number of male children already born to the couple, and /is the number of female 
children already born to the couple. 

(a) What is the probability that the third child will be a boy given that the first 
two are girls? 

(b) Find the probability that the first three children will be all boys. 

(c) Find the probability of at least one boy in the first three children. 

(Your answers will be expressed in terms of e L and s 2 .) 

*59 A network of switches a , 6, c, and d is connected across the power lines A and B 
as shown in the sketch. Assume that the switches operate electrically and have 
independent operating mechanisms. All are controlled simultaneously by the 
same impulses; that is, it is intended that on an impulse all switches shall close 
simultaneously. But each switch has a probability p of failure (it will not close 
when it should). 


A 


(a) What is the probability that the circuit from A to B will fail to close? 

(b) If a line is added on at e, as indicated in the sketch, what is the probability 
that the circuit from A to B will fail to close? 

(c) If a line and switch are added at e f what is the probability that the circuit from 
A to B will fail to close? 

n 

60 Let ^ 2 ,..., B n be mutually disjoint, and let B = (J Bj. Suppose ?[£,]> 0 

and P[A \ Bj] - p for j = \ , . . . , //. Show that P[A | B]=p. 

61 In a laboratory experiment, an attempt is made to teach an animal to turn right 
in a maze. To aid in the teaching, the animal is rewarded if it turns right on a 
given trial and punished if it turns left. On the first trial the animal is just as 
likely to turn right as left. If on a particular trial the animal was rewarded, his 
probability of turning right on the next trial is p x > l, and if on a given trial the 
animal was punished, his probability of turning right on the next trial is p 2 >Pu 
(a) What is the probability that the animal will turn right on the third trial? 
(Z?) What is the probability that the animal will turn right on the third trial, 

given that he turned right on the first trial? 



I 


50 PROBABILITY 


*62 You arc to play ticktacktoe with an opponent who on hts turn makes his mark by 
selecting a space at random from the unfilled spaces You get to mark first, 
Where should you mark to maximize your chance of winning and what a your 
probability of winning? (Note that your opponent cannot wm, he can only 
tie) 

63 Urns I and II each contain two whttc and two black balls One ball is selected 
from um I and transferred to urn II, then one ball is drawn from urn II and turns 
out to be white What is the probability that the transferred ball was white? 

64 Two regular tetrahedra with faces numbered 1 to 4 are tossed repeatedly until 4 
total of 5 appears on the down faces What is the probability that more than tw® 
tosses are required? 

65 Given P[,4] — 5 and ftA u 2*] — 7 

(а) Find FIB] if A and B are independent 

(б) Find F[fll if A and B arc mutually exclusive, 

(c) Find P[B] if P[A | £} = 5 

66 A single die is tossed, then n coins are tossed, where n is the number shown on the 
die What is the probability of exactly two heads’ 

*d7 In simple Mcndeltait inheritance, a physical characteristic of a plant or animal is 
determined by a single pair of genes The color of peas is an example Letyand 
g represent yellow and green, peas will be green if the plant has the color gene 
pair (g, g ), they will be yellow if the color gene pair is ( y , y) or ( y,g ) In view of 
this last combination, yellow is said to be dominant to green Progeny get one 
gene from each parent and are equally likely to get either gene from each parent s 
pair If <y, y) peas are crossed with {g, g ) peas, all the resulting peas will be {y,g) 
and yellow because of dominance If O' 9) peas are crossed with (g g) peas, the 
probability is 5 that the resulting peas will be yellow and is 5 that they Will be 
green In a large number of such crosses one would expect about half the result* 
mg peas to be yellow, the remainder to be green In crosses between (y, g) and 
(y, g) peas, what proportion would be expected to be yellow? What proportion 
of the yellow peas would be expected to be (y, y)? 

*68 Peas may be smooth or wrinkled, and this is a simple Mendehan character 
Smooth is dominant to wrinkled so that (s, j) and (j, w) peas are smooth while 
(w, w) peas are wrinkled If (y, d) (*, w) peas are crossed With {g g] (tv, w) peas, 
what are the possible outcomes, and what are their associated probabilities? For 
the (y, g) (t, m») by (g g) (i, w) cross? For the (y g) (j, w) by (y, g) (s, w) cross? 

69 Prove the two unproven parts or Theorem 32 

70 A supplier of a certain testing device claims that his device has high reliability 
Inasmuch as P[A\E\^t\A\B]** 95, where A « {device indicates component is 
faulty) and B =» {component is faulty) You hope to use the device to locate the 
faulty components in a large batch of components of which 5 percent are faulty 
(o) What 1$ P[B\Ayt 

(A) Suppose you want P\B\A\- 9 Let p ^P(A\Bl~P[J]B] How large 
does p have to be? 
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RANDOM VARIABLES, DISTRIBUTION 
FUNCTIONS, AND EXPECTATION 


1 INTRODUCTION AND SUMMARY 

The purpose of this chapter is to introduce the concepts of random variable , 
distribution and density functions , and expectation . It is primarily a “ definitions- 
and-their-understanding” chapter; although some other results are given as 
well. The definitions of random variable and cumulative distribution function 
are given in Sec. 2, and the definitions of density functions are given in Sec. 3. 
These definitions are easily stated since each is just a particular function. The 
cumulative distribution function exists and is defined for each random variable; 
whereas, a density function is defined only for particular random variables. 
Expectations of functions of random variables are the underlying concept of all 
of Sec. 4. This concept is introduced by considering two particular, yet 
extremely important, expectations. These two are the mean and variance, 
defined in Subsecs. 4.1 and 4.2, respectively. Subsection 4.3 is devoted to the 
definition and properties of expectation of a function of a random variable. 
A very important result in the chapter appears in Subsec. 4.4 as the Chebyshev 
inequality and a generalization thereof. It is nice to be able to attain so famous 
a result so soon and with so little weaponry. The Jensen inequality is given in 
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Subsec 4 5 Moments and moment generating functions, which are expects 
tions of particular functions, are considered in the final subsection One major 
unproven result, that of the uniqueness of the moment generating function, is 
given there Also included is a brief discussion of some measures of some 
characteristics, such as location and dispersion, of distribution or densty 
functions 

This chapter provides an introduction to the language of dijfnfcution 
theory Only the univariate case is considered, the bivariate and multivariate 
cases will be considered m Chap IV It serves as a preface to, or even as a 
companion to, Chap III, where a number of parametric families of distribution 
functions is presented Chapter 111 gives many examples of the concepts 
defined in Chap II 


2 ^RANDOM VARIABLE AND CUMULATIVE 
X DISTRIBUTION FUNCTION 


2 1 Introduction 

In Chap T we defined what we meant by a probability space, which we denoted 
by the triplet (Q, si, P[ }) We started with a conceptual random experiment, 
we called the totality of possible outcomes of this experiment the sample space 
and denoted it by ft si was used to denote a collection of subsets, called 
events, of the sample space Finally out probability function P[ } was a set 
function having domain st and counterdomain the interval 10, 1) Our object 
W3S, and still is, to assess probabilities of events In other words, we want to 
model our random experiment so as to be able to give values to the probabilities 
of events The notion of random tunable, to be defined presently, will be used 
to describe events, and a cumulative distribution function will be used to give the 
probabilities of certain events defined in terms of random variables, so both 
concepts will assist us in defining probabilities of events, our goal One advan- 
tage that a cumulative distribution function will have over its counterpart, the 
probability function (they both give probabilities of events), is that it is a 
function with domain the real line and counterdomain the interval (0, 1] Thus 
we will be able to graph it It will become a convenient tool in modeling 
random experiments In fact, we will often model a random experiment by 
assuming certain things about a random variable and its distribution function 
and in so doing completely bypass describing the probability space 
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2.2 Definitions 

We commence by defining a random variable. 

Definition 1 Random Variable For a g iven probability space (ft, j/, 
P[ ' ])» a random variable , deno ted by X or V( - ), is a function with 
d omain h^ nd counterdomain the real line. The functiorT X( - )~must be 
£uch that the set^/t r 73efine3^by /f r — {cu: X(a)) < r], belongs to s 4 for^ ;very 
real numberj^ ~ ~ ~ ’ fj// 

If one thinks in terms of a random experiment, ft is the totality of out- 
comes of that random experiment, and the function, or random variable, X( * ) 
with domain ft makes some real number correspond to each outcome of the 
experiment. That is the important part of our definition. The fact that we 
also require the collection of a/s for which X(w) < r to be an event (i.e., an element 
of stf) for each real number r is not much of a restriction for our purposes 
since our intention is to use the notion of random variable only in describing 
events. We will seldom be interested in a random variable per se; rather we 
will be interested in events defined in terms of random variables. One might 
note that the P[ * ] of our probability space (ft, j/, P[ • ]) is not used in our 
definition. 

The use of words “random” and “variable” in the above definition is 
unfortunate since their use cannot be convincingly justified. The expression 
“random variable” is a misnomer that has gained such widespread use that it 
would be foolish for us to try to rename it. 

In our definition we denoted a random variable by either X( * ) or X , 
Although X ( 4 ) is a more complete notation, one that emphasizes that a random 
variable is a function, we will usually use the shorter notation of X. For many 
experiments, there is a need to define more than one random variable; hence 
further notations are necessary. We will try to use capital Latin letters with or 
without affixes from near the end of the alphabet to denote random variables. 
Also, we use the corresponding small letter to denote a value of the random 
variable. 


EXAMPLE l Consider the experiment of tossing a single coin. Let the 
random variable X denote the number of heads, ft = (head, tail}, and 
X(co) = 1 if co - head, and X(co) = 0 if co = tail ; so, the random variable X 
associates a real number with each outcome of the experiment. We 
called X a random variable so mathematically speaking we should show 
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2d dice 


5 


FIGURE 1 I 2 J 4 5 6 


that it satisfies the definition, that is we should show that {to r} 

belongs to si for every real number r si consists of the four subsets 
<p, {head}, {tail}, and ft Now, if r <0, {a> X((o) £ r} = tf> , and if 
0 ^r<l,{<u Af(<o)£r} = {tail}, and if l,(oi Af(eo) s r} = ft = {head, 
tail} Hence, for each r the set {to X(to ) S, r} belongs to si , so Jf( ) is a 
random variable //// 


' EXAMPLE 2 Consider the experiment of tossing two dice ft can be de- 
scribed by the 36 points displayed in Fig 1 ft « {( 1 , 7 ) 1*1, ,6 and 

7 = 1 , , 6 } Several random variables can be defined, for instance, let 

X denote the sum of the upturned faces, so ^(tu) « i +jifoi = (i j) Also, 
let Y denote the absolute difference between the upturned faces, then 
y(co) = } i — y | if to = (/, 7 ) It can be shown that both A” and T are ran 
dom variables We see that X can take on the values 2, 3, , 12 and Y 

can take on the values 0, 1, ,5 //// 

In both of the above examples we described the random variables in terms 
of the random experiment rather than in specifying their functional form, such 
will usually be the case 

^-Definition 2 Cumulative distribution function The cumulative distribution 
/unction of a random variable X, denoted by F x { ), is defined to be 
that function with domain the real line and counterdomam the interval 
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[0, 1] which satisfies F x (x) = P[X < x] = P[{ar. X(cd) <> x}] for every real 
number x. jjjj 

A cumulative distribution function is uniquely defined for each random 
variable. If it is known, it can be used to find probabilities of events defined 
in terms of its corresponding random variable. (One might note that it is in 
this definition that we use the requirement that {tu: X(co) <> r} belong to st for 
every real r which appears in our definition of random variable X .) Note that 
different random variables can have the same cumulative distribution function. 
See Example 4 below. 

The use of each of the three words in the expression “cumulative distri- 
bution function” is justifiable. A cumulative distribution function is first of 
all a function ; it is a distribution function inasmuch as it tells us how the values 
of the random variable arc distributed, and it is a cumulative distribution func- 
tion since it gives the distribution of values in cumulative form. Many writers 
omit the word “cumulative” in this definition. Examples and properties of 
cumulative distribution functions follow. 


EXAM PLE 3 Consider again the experiment of tossing a single coin. Assume 
that the coin is fair. Let X denote the number of heads. Then, 

(0 if X < 0 

F x (x) = |i if 0 <; * < 1 

(l if I < AT. 

Or F*(.v) = i 7 (0i | ,(*) + 7 f i f „ } (x) in our indicator function notation. //// 


EXAMPLE 4 In the experiment of tossing two fair dice, let Y denote the 
absolute difference. The cumulative distribution of Y, F r { • ), is sketched 
in Fig. 2. Also, let X k denote the value on the upturned face of the &th 
die for k = 1, 2. X, and X 2 arc different random variables, yet both 
have the same cumulative distribution function, which is F Xk (x) = 

1.7 hi i+n(- r ) + f[6. K) (x) and is sketched in Fig. 3. I HI 

i=i 6 

Careful scrutiny of the definition and above examples might indicate the 
following properties of any cumulative distribution function F x { • ). 
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Ffy) 



FIGURE 2 



Properties of a Cumulative Distribution Function F x { ) 

0 ) F x (-co)s lim F x (x) = 0, and F*(+co) s Jim /*(*) = l 

(n) F x ( ) is a monotone, nondecreasing function, that is, F x {a) £ F x {b) 

for a <b 

(in) F x ( ) is continuous from the right, that is 

litn F x (x + ft) => F^x) 
o <*-o 

Except for (n) we will not prove these properties Note that the event 
{to X(u)<b)~{Xsb} = {XZa)\j{a <X^b] and {X£a}n{a <X£b) 
« ^ hence F x (b) « P{ X 6] » P[X S a] + P[a < X ^ 6J & P[X £ oj = F x (a) 
w hich proves (u) Property (ui), the continuity of F x { ) from the right, results 
from our defining F x (\) to be P[X&x] If we had defined as some authors do 
F x {x) to be P[X < x] then F x ( ) would have been continuous from the left 

Definition 3 Cumulative distribution function Any function F( ) with 
^domain the real line and counicrdamam the interval [0, I] satisfying the 
above three properties is defined to be a etimulatue distribution function 

mi 

Tins frdfinmon throws us to use Yne "icrm "eumiftanve distribution lune* 
lion " without mentioning random variable 

After defining what is roevnt by continuous and discrete random variables 
in the first two subsections of the next section, we will give another property 
that cumulative distribution functions possess, the property of decomposition 
into thtce parts 
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Fx>W 

l 


1 1 I 111! 

0 1 2 3 4 5 6 

FIGURE 3 


The cumulative distribution functions defined here are univariate; the 
introduction of bivariate and multivariate cumulative distribution functions 
will be deferred until Chap. IV. 


3 DENSITY FUNCTIONS 

Random variable and the cumulative distribution function of a random variable 
have been defined. The cumulative distribution function described the distri- 
bution of values of the random variable. For two distinct classes of random 

mm - ■ - - — - 

variable s, the distribution of values can be described mo re simply by usin g 
density functions. The se two classes , distinguished_by*Tlfe words “discrete” 
and “continuous,” are considered in the next two subsections. 


3.1 Discrete Random Variables 

Definition 4 Discrete random variable A random variable X will be 
defined to be discrete if the range ofATs countable. If a random variable 
X is discrete, then its corresponding cumulative distribution function 
F x ( • ) will be defined to be discrete . ' //// 

By the range of X being countable we mean that there exists a finite or 
denumerable set of real numbers, say jq, x 2 , *3 , . . . , such that X takes on values 
only in that set. If X is discrete with distinct values x u x 2 , .... x„, ....then 
Q = (J {w: X(aj) = *„} = U W = *«}» and W ~ n = x j) = 4> for ‘ 

n n 

hence 1 = P[Q] = £P[Y = *„] by the third axiom of probability. 
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Definition 5 Discrete density function of a discrete random variable if 
ATs a discrete random variable with distinct values x x , , , 

then the function, denoted byf x { ) and defined by 


M-e*-* 


if X = Xj,j= 1,2, ,n , 

if X it Xj 


0) 


is defined to be the discrete density function of X (fff 


The values of a discrete random variable are often called mass points, and 
fx(xj) denotes the mass associated with the mass point x i Probabili ty mass 
function, discrete frequency func tion, and probability function are o ther tenns 
usedjn place of discrete density function Also, the notation p x { ) is some* 
times used intead of f x ( ) for discrete density functions f x ( ) is a function 
wjth domain the real fine and counterdomam the interval [0, J) Jf we use the 
indicator function, 

!.« 1 

where /{xjW = 1 if X «* x, and /^(r) - 0 if x & x H 

Theorem 1 Let Jf be a discrete random variable F x ( ) can be obtained 
from f x { ) and vice versa 

proof Denote the mass points of Jfby x it x 2 . Suppose f x { ) 
is given, then F x (x) = £ fx( x j) Conversely, suppose ■/>( ) is given, 

U 

then f x (Xj) = F x ( x j) — hot F x {xj — h), hence f x {xj) can be found for 
o<*-»o 

each mass point x } , however,/]^*) = Ofor x ^ x Jt js* 1,2, , $o/i(x)is 

determined for all real numbers //// 


EXAMPLE 5 To illustrate what is meant in Theorem 1, consider the experi- 
ment of tossing a single die Let X denote the number of spots on the 
upper face 

/*!-©*. *»<*>, 

and 


JH*) ,+„(*) + / I6 .,(*) 
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FIGURE 4 


According to Theorem I, for given F x (x) can be found for any x ; 
for instance, if x = 2.5, 


^*(2.5) — £ fx(Xj) — /r(0 +fxO) = 7 * 

U:xjZ2. 5) 6 



And, if F x (') is given, f x {x) can be found for any x. For example, for 
x = 3, 


m w -/->=(§) -©4 mi 


The cumulative distribution function of a discrete random variable has 
steps at the mass points; that is, at the mass point x ; , F x ( •) has a step of size 
fx(Xj), and F x (*) is flat between mass points. 


EXAMPLE 6 Consider the experiment of tossing two dice. Let X denote 
the total of the upturned faces. The mass points of X are 2, 3, 12. 

f x {-) is sketched in Fig. 4. Let Y denote the absolute difference of 
the upturned faces; then / r (-) is given in tabular form by 


y 

0 

1 

2 

3 

4 

5 

My) 

6 

36 

1 0 

3 6 

36 

6 

36 

4 

3 6 

2 

3 6 


The discrete density function tells us how likely or probable each of the 
values of a discrete random variable is. It also enables one to calculate the 
probability of events described in terms of the discrete random variable X 
For example, let X have mass points x x , x 2 , . . . , x n , . . . ; then P[a < X < b] = 
L fx(Xj) for a<b. 

J:{a<xj<b } 
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Definition 6 Discrete density function Any function /( ) with domain 
the real line and counterdomain [0, 1] is defined to be a discrete density 
function if for some countable set x t ,x 2 , , x„, , 

ft) f(xj) > 0 for/ = 1, 2, 

(ll) f(x) * 0 for X J - 1 , 2, 

(m) £ /(•*/) = 1, where the summation is over the points Jtj, Xj, , 


This definition allows us to speak of discrete density functions without 
reference to some random variable Hence we can talk about properties that 
a density function might have without referring to a random variable 

3 2 Continuous Random Variables 

Definition 7 Continuous random variable A random variable X is 
called continuous if there exists a function./^ ) such that — J /t(“V u 

for every real number x The cumulative distribution function Frf ) of a 
continuous random variable X is called absolutely continuous //// 

Definition 8 Probability density function of a continuous random variable 
If X is a continuous random variable, the function f x { ) in F x (*) = 

J f x {u) du is called the probability density function of X //// 

Other names that are used instead of probability density function include 
density function continuous density function and integrating density function 
Note that strictly speaking the probability density function f x ( ) of a 
random variable X is not uniquely defined All that the definition requires is 
that the integral of f x { ) gives F x {x) for every x and more than one function 
f x { ) may satisfy such requirement For example suppose F a (x) = x/ I0 ^(x) + 

/ n „fx), then f x (u) = 7 t0 t fu) satisfies F x (x) = J f x (u) du for every x, and 
so fx{ ) is a probability density function of X However f x (u) ~ /,„ *,(«) + 
69/, i} (u) + I a „(«) also satisfies F x (x) = J f x (u) du (The idea is that if the 

value of a function is changed at only a ‘ few ’ points then its integral is 
unchanged ) In practice a unique choice off x { ) is often dictated by continuity 
considerations and for this reason we will usually allow ourselves the liberty of 
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speaking of the probability density when in fact a probability density is more 
correct. 

One should point out that the word “continuous” in “continuous 
random variable” is not used in its usual sense. Although a random variable 
is a function and the notion of a continuous function is fairly well established 
in mathematics, “continuous” here is not used in that usual mathematical 
sense. In fact it is not clear in what sense it is used. Two possible justifica- 
tions do come to mind. In contrasting discrete random variables wit h con tin- 
uous random variables, one notesThat a discrete random variable takes on a 
finite or denumerable set of values whereas a continuous r andom v ariable takes 
orf a nondenumerable set of valuesT^Pdssibly it is the connection between 
“nondenumerable” and “continuum'” that justifies use of the word “contin- 
uous.” All the continuous random variables that we shall encounter will take 
on a continuum of values. The second justification arises when one notes that 
the absolute continuity of the cumulative distribution function is the regular 
mathematical definition of an absolutely continuous function (in words, a 
function is called absolutely continuous if it can be written as the integral of its 
derivative); the “continuous,” then, in a corresponding continuous random 
variable could be considered just an abbreviation of “ absolutely continuous.” 


Theorem 2 Let X be a continuous random variable. Then i^(*) can 
be obtained from an /*(■) , and vice versa. 

proof If Xls a continuous random variable and an f x ( •) is given, 

X 

then F x (x) is obtained by integrating/^ • ) ; that is, F x (x) = J f x (u) du. On 

— co 

the other hand, if F x {-) is given, then an f x (x) can be obtained by differ- 
entiation; that is, f x {x) = dF x (x)/dx for those points x for which F x (x) is 
differentiable. //// 

T & 

The notations for discret e density function and probabil ity density f unc- 
tion are the same, yet they have quite different interpretations. For discrete 
random variables f x (x) = P[X = x], which is not true for continuous random 
variables. For continuous random variables, 

, , „ dF x (x) , . F x (x + Ax) - F x (x - Ax) 

Ux) Jr. s; 


hence f x (x) 2Ax « F x (x + Ax) - F x (x - Ax) = P[x - Ax < X < x + Ax] ; that 
is, the probability that X is in a small interval containing the value x is approxi- 
mately equal to f x (x ) times the width of the interval. For discrete random 
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vanabJes/*( ) is a function wi th domain the real line and counterdomain Hie 
interval 1], whereas, for continuous random variables j\{ ) is a function vm h 
do main the real line and counTerdomam the' infinite interva l [0, ooL 

Remark We will use the term *' density function” without the modifier 
of “ discrete " or ” probability " to represent either kind of density //// 


EXAMPLE 7 Let X be the random variable representing the length of a 
telephone conversation One could model this experiment by assuming 
that the distribution of X is given by F x { x ) = (l - e~ lx )f if> B) (x), where 
A is some positive number The corresponding probability density func- 
tion would be given by f x {x ) * ^(x) If we assume that 

telephone conversations are measured m minutes, P[$<X£ 10] = 
f J° le~ xi dx = e' iX - 1 ~ 102 » e~ 1 ~ e~ 1 ^ 23 for A = or P{5 < X £ 10] 
- P{X £ W1 - P{X «s S] - (I - e~ il °) = <r* - r" 2 for 

\6em4 1111 

The probability density function is used to calculate the probability of 
events defined m terms of the corresponding continuous random variable X 
For example, P[a < X < b) =* ] b a f x {x) dx iora<b 

Definition 9 Probability density function Any function /( ) with domain 
the real line and counterdomain [0, co) is defined to be a probability 
density function if and only if 

0) fix) 2: 0 for all x 

00 £/(*) dx => I //// 


With this definition we can speak of probability density functions without 
reference to random variables We might note that a probability density func- 
tion of a continuous random variable as defined in Definition 8 does indeed 
possess the two properties in the above definition 


3 3 Other Random Variables 

Not all random variables are either continuous or discrete or not all cumulative 
distribution functions are either absolutely continuous or discrete 
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EXAMPLE 8 Consider the experiment of recording the delay that a motorist 
encounters at a one-way traffic stop sign. Let X be the random variable 
that represents the delay that the motorist experiences after making the 
required stop. There is a certain probability that there will be no oppos- 
ing traffic so that the motorist will be able to proceed with no delay. On 
the other hand, if the motorist has to wait, he may have to wait for any of 
a continuum of possible times. This experiment could be modeled by 
assuming that X has a cumulative distribution function given by F x (x) 
- (1 - pi'~ This F x (x) has a jump of 1 - p at x = 0 but is 
continuous for .v > 0. See Fig. 5. //// 

Many practical examples of cumulative distribution functions that are 
partly discrete and partly absolutely continuous can be given. Yet there are 
still other types of cumulative distribution functions. There are continuous 
cumulative distribution functions, called singular continuous , whose derivative 
is 0 at almost all points. We will not consider such distribution functions other 
than to note the following result. 

Decomposition of a cumulative distribution function Any cumulative 
distribution function F(x ) may be represented in the form 

F(x) - p x F\x) 4- p z F**(x) 4- r c {x) y where p t £ 0, / = 1 , 2, 3. (3) 

3 

Yp t ~ Land /*(•), / 7JC ('),andT NC (0 arecachcumulativedistribution functions 

with /*(■) discrete, F JC (*) absolutely continuous, and F s \-) singular continuous. 

Cumulative distributions studied in this book will have at most a discrete 
part and an absolutely continuous part; that is, thep 3 in Eq. (3) will always be 0 
for the F(') that we will study,. 
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EXAMPLE 9 To illustrate how the decomposition of a cumulative distribu- 
tion function can be implemented, consider F x (x) = (1 - pe' 1 *)!^ m} (x) 
as in Example 8. F x (x) = (1 - p)F\x) + pF“(x), where F*(x) =* / [0 ,.>(*) 
and F“(x) = (l Note that F x (x) » (I - p)F i (x) + 

pF‘\x) = (1 -p}/ t0 .«>(*) + PCI - = (1- 

//// 

A density function corresponding to a cumulative distribution that is 
partly discrete and partly absolutely continuous could be defined as follows 
If F(x) - (1 - p)F*(x) +pF ae (x), where 0 <p< 1 and F i ( ) and F ac (') are, 
respectively, discrete and absolutely continuous cumulative distribution func- 
tions, let the density function f(x) corresponding to F(x) be defined by fix) 
=* (1 — p)f > (x) + pf at (x), where /*( ) is the discrete density function corre- 
sponding to F d ( ) and /* c (') is the probability density function corresponding to 
f* c (‘) Such a density function would require careful interpretation, so when 
cons ider ing cumulative distribution functions that are partly discrete and 
partly continuous, we will tend to work with the cumulative distributiorTfijnc - 
tion itself rather than with a den sity function 

Remark In. future chapters we will frequently have to state that a 
random variable has a certain distribution. We will make such a state- 
ment by giving either the cumulative distribution function or the density 
function of the random variable of interest //// 


4 EXPECTATIONS AND MOMENTS 

An extremely useful concept in problems involving random variables or distri- 
butions is that of expectation The subsections of this section give definitions 
and results regarding expectations. 


4.1 Mean 

Definition 10 Mean Let X be a random variable The mean of X. 
denoted by p K or Is defined by 

eix]~Zxjfx(*j) (•*) 


(i) 

if X is discrete with mass points jc lt Xj , , 
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(i ‘) <?m = f" Xf x (x)dx (5) 

- co 

if X is continuous with probability density function f x {x). 

(“0 = f [1 — F X (X)] dx — f F x (x) dx (6) 

for an arbitrary random variable X . //// 


In (i), is defined to be the indicated series provided that the series is 
absolutely convergent; otherwise, we say that the mean does not exist. And in 
(ii), 8[X] is 'defined to be the indicated integral if the integral exists; otherwise, 
we say that the mean does not exist. Finally, in (iii), we require that both 
integrals be finite for the existence of 8[X]. 

Note what the definition says: In £ *,/*(*/)> the summand is the jth value 
f J 

of the random variable X multiplied by the probability that X equals that 7 th 
value, and then the summation is over all values. So 8 [X] is an “ average ” of the 
values that the random variable takes on, where each value is weighted by the 
probability "that the* random variable is equal to that value. Values that are 
more probable receive more weight. The same is true in integral form in (ii). 
There the value x is multiplied by the approximate probability that X equals 
the value jc, namely f x (x) dx, and then integrated over all values. 

Several remarks are in order. 


Remark In the definition of a mean of a random variable, only density 
functions [in (i) and (ii)] or distribution functions [in (iii)] were used; 
hence we have really defined the mean for these functions without reference 
to random variables. We then call the defined mean the mean of the 
cumulative distribution function or of the appropriate density function. 
Hence, we can and will speak of the mean of a distribution or density 
function as well as the mean of a random variable. //// 


Remark 8 [X] is the center of gravity (or centroid) of the unit mass that 
is determined by the density function of X. So the mean of A" is a meas- 
ure of where the Values of the random variable X are “centered.” Other 
measures of “location” or “center” of a random variable or its corre- 
sponding density are given in Subsec. 4.6. //// 



66 RANDOM VARIABLES DISTRIBUTION FUNCTIONS, AND EXPECTATION 


Remark (in) of the definition is for all random variables, whereas 
(0 ts for discrete random variables, and (n) is for continuous random 
variables Of course 6[X] could have been defined by just giving (m) 
The reason foT including (i) and (a) is that they are more intuitive for 
their respective cases It can be proved, although we will not do it, that 
(i) follows from (m) in the case of discrete random variables and (li) follows 
from (ni) in the case of continuous random variables Our mam use 
of (in) will be in finding the mean of a random variable X that is neither 
discrete nor continuous See Example 12 below (((( 

EXAMPLE 10 Consider the experiment of tossing two dice Let X denots 
the total of the two dice and Y their absolute difference The discrete 
density functions for X and Y are given m Example 6 

= ( £'/r(0 = 0 -& + 1 

+ 2 -& + 3 A + 4 Tf + 5 -& = fij 

*[*]- MO -7 

1-3 

Note that <f[y] is not one of the possible values of Y fill 

EXAMPLE ll Let X be a continuous random variable with probability 
density function / x (x) - 

f <t* * /JW " dx ~\ 

The corresponding cumulative distribution function is 

FM = 0- e ~ l Vv> »)W> so W - f U - PiMI dx 

♦'o 

- f° FM dx^j\l - 1 + e~ lx ) dx - m nil 

j CO J 0 


EXAMPLE 12 Let X be a random variable with cumulative distribution 
function given by F x (x) = (1 - pe” 1 *)!^ ^{x), then 

SIX) - J" II - F x (x)] dx - £ F x [x) dx = J" pe'** dx^ 

Here, we have used Eq (6) to find the mean of a random variable that is 
partly discrete and partly continuous //// 
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EXAMPLE 13 Let A" be a random variable with probability density function 
given by f x (x) = *“%, „,(*); then 

r® dx 

6[X] — I x ~i - Iim log e b = oo, 

^ 1 X ft-* co 

so we say that 6[X] does not exist. We might also say that the mean of X 
is infinite since it is clear here that the integral that defines the mean is 
infinite. jjjj 

4.2 Variance 

The mean of a random variable X , defined in the previous subsection, was a 
measure of central location of the density of X. The variance of a random vari- 
able X will be a measure of the spread or dispersion of the density of X . 

Definition 11 Variance Let AT be a random variable, and let p x be 
6 [X]. The variance of X , denoted by a x or var [X] % is defined by 

0) var [X]=Y, (xj - H x ) 2 fx(Xj) (7) 

J 

if X is discrete with mass points x x , x 2 

(ii) var [X] = [ (x - p x ) 2 f x (x) dx (8) 

J -CO 

if X is continuous with probability density function f x (x). 

(iii) var [X] = f 2x[l - F x (x) + F x (-x)] dx - /if (9) 

J o 

for an arbitrary random variable X. //// 

The variances are defined only if the series in (i) is convergent or if the 
integrals in (ii) and (iii) exist. Again, the variance of a random variable is 
defined in terms of the density function or cumulative distribution function of 
the random variable; hence variance could be defined in terms of these functions 
without reference to a random variable. 

Note what the definition says: In (i), the square of the difference between 
the jth value of the random variable X and the mean of X is multiplied by the 
probability that X equals the yth value, and then these terms are summed. 
More weight is assigned to the more probable squared differences. A similar 
comment applies for (ii). Variance is a measure of spread since if the values 
of a random variable X tend to be far from their mean, the variance of X will 
be larger than the variance of a comparable random variable Y whose values 
tend to be near their mean. It is clear from (i) and (ii) and true for (iii) that 
variance is nonnegative. We saw that a mean was the center of gravity of a 
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density, similarly (for those readers familiar with elementary physics or 
chanics), variance represents the moment of inertia of the same density will 
respect to a perpendicular axis through the center of gravity 

Definition 12 Standard deviation If X is a random variable, the 
standard dematton of X, denoted by <r x , ts defined as + ^/vat [Xj {{{{ 

The standard deviation of a random variable, like the variance, is a meas- 
ure of the spread or dispersion of the values of the random variable In many 
applications it is preferable to the variance as such a measure since it will have 
the same measurement units as the random variable itself 

EXAMPLE 14 Let X be the total of the two dice m the experiment of tossing 
two dice 

var [A1 = £(*, - Px)Vx(*j) 

=» (2 - 7) l ys- + (3 — T) 1 ^ + (4 w 7)*^ + (5 — rf-sz 
+ (6 — 7) 2 t<; + (7 — T) z -ys + (8 — T) 2 -£s + (9 — 7) J yj 

+ (10 -7)^ + (II -VA +(i2-7) a *~*tf //// 

EXAMPLE 15 Let A be a random variable with probability density given by 
7x00 = ^ b)Wi then 

Var[X] * jjx~ti x Y/ x {x)dx 
_ l 

//// 


iiXAMPLE 16 Let X be a random variable with cumulative distribution 
given by F x (x) * (1 - pe~ Xr )ki «>(*)• then 

Var [X] = J"2x(l - F{*) + F(~ x )] dx - ft} 
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4.3 Expected Value of a Function of a Random Variable 

We defined the expectation of an arbitrary random variable X, called the mean 
of X, in Subsec. 4.1. In this subsection, we will define the expectation of a 
function of a random variable for discrete or continuous random variables. 



Definition 13 Expectation Let X be a random variable and g(*) be 
a function with both domain and counterdomain the real line. The 
expectation or expected value of the function g{*) of the random variable 
X f denoted by S[g[X)\ is defined by: 

(0 <%(*)] = £ gfrjfxixj) ( 10 ) 

J 

if X is discrete with mass points x lt x 2 , . . . , x Jy . . . (provided this 
series is absolutely convergent). 

(ii) <W0] = r 9(pc)f x (x)dx (11) 

J — CO 

if X is continuous with probability density function f x (x) (provided 
J-JtfOO I fx(x) dx < co).* nil 

Expectation or expected value is not really a very good name since it is 
not necessarily what you “expect.” For example, the expected value of a 
discrete random variable is not necessarily one of the possible values of the 
discrete random variable, in which case, you would not “expect” to get the 
expected value. A better name might be “ average value ” rather than “ expected 
value.’*]]] 

Since &[g{X)] is defined in terms of the density function of X, it could be 
defined without reference to a random variable. 

Remark If g(x) = *, then S [g{X )] « £ [X] is the mean of X . If g(x) = 
(x-p x ) 2 , then S[g(M = S[(X - p x f] = var [X]. I HI 

* €[g(X)] has been defined here for random variables that are either discrete or 
continuous; it can be defined for other random variables as well. For the reader 
who is familiar with the Siieltjes integral, <F[g(X)) is defined as the Stieltjes integral 
l-a>g{x) dF x (x) (provided this integral exists), where F x (') is the cumulative distribu- 
tion function of X. If Xis a random variable whose cumulative distribution function is 
partly discrete and partly continuous, then (according to Subsec. 3.3) F x (x) = 
(1 — p)F*(x) -f* pF* c (x) for some 0 <p < 1. Now fi[g{X) ] can be defined to be €\g{X)] 
= (1 —p) 2 g(xj)f*(xj) + P J - o og(x)f lc (x) dx, where / d (*) is the discrete density func- 
tion corresponding to F d (*) and / IC (‘) is the probability density function corre- 
sponding to F 2C (‘)* 
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f Theorem 3 Below are properties of expected value 
(i) S[c\ = c for a constant c 
(it) = e<J[g(X)l for a constant c 

<»0 <?fc, ?l (X) + cj g 2 (X)J = c^[g,(X)l + cj S( 3l (X)\ 

(iv) S[g t (X)] £ S[g 2 (X)] t(g t (x) £ g 2 (x) for all * 

proof Assume X is continuous To prove (i), take g( x ) ** e , then 

<ffc(X>I * «?fcj « f c//x) rfx = c £jx(*) * « e 
<*t<3(X)l = J cg(x)f x (x) dx^cj g(x)f x {x) dx = e5fe(X)l, 
which proves (u) (m) is given by 

<?faffi(X) + c 2 ? 2 (X)l *= J [cifftW + C:M*) l/*(x) dx 

» Ci J 9i(x)f x {x) dx + ctj g 2 (*)/xW d< 
~c 1 Slff t (X))-hc 1 Jlf J (X)) 

Finally. 

0 < SlffAX) -5,(X)] = 5fo 2 (X)] - <?fo,(X)]. 

which gives (iv) 

Similar proofs could be presented for the discrete random variable 
case fill 

Theorem 4 If X is a random variable, var JX) =■» <?l(X - 5[XD 2 J = 
5(X 2 ] - (SIX]) 2 provided S[X 3 ] exists 

Pitoor (We first note that if 5|X 2 J exists, then <ffX] exists )* By 
our definitions of variance and «?[t?(X)], it follows that var [X] = 
d((X - S(X\i 2 \ Now S(pc - «f(Xl) 2 l = <?(X 2 - 2XJ(X1 + («T(XD*1“ 
S[X 2 ] ~ 2(5 [X}) 2 + (S[X ]) 2 « «?{X 2 1 - (5{X]) 2 III! 

The above theorem provides us with two methods of calculating a vari- 
ance, namely 5 ((X - ft *) 2 J or 5[X 2 1 - p* Note that both methods require ft* 

• Kttetndmthe (inure we ire not going to concern ourselves with checking existence 
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S[g{X)] is used in each of the following three subsections. In Subsec. 4.4 
and 4.5 two inequalities involving <5*t^(jSQ] are given. Definitions and examples 
of £[g(X)] for particular functions ^(-) are given in Subsec. 4.6. 

4.4 Chebyshev Inequality 

Theorem 5 Let X be a random variable and #(*) a nonnegative function 
with domain the real line; then 

P[g(X) > k ] < for every k > 0. (12) 

rt 

proof Assume that X is a continuous random variable with 
probability density function then 

S[g(X)] = f g(x)f x (x) dx = f ffW/xW dx 

J -co J [xzg(x)Zk) 

+ f g(x)f x (x) dx f g(x)f x {x) dx 

J[xzg{x)<k) J [x:g{x)Zk} 

f Jc/ x (x)dx = kPfe(X)^k]. 



72 RANDOM VARIABLES DISTRIBUTION FUNCTIONS AND EXPECTATION 


that is the probability that X falls within ra x umts of fi x u greater than « 
equal to 1 - l/r J For r = 2 one gets P{^i x — la x <X < + 2n,] £ } or 

for any random variable X having finite variance at least three-fourths of the 
mass of X falls within two standard deviations of its mean 

Ordinarily to calculate the probability of an event described in terms of 
a random variable X t the distribution or density of X is needed, the Chebyshev 
inequality gives a bound which does not depend on the distribution of X, for the 
probability of particular events desenbed in terms of a random variable and 
its mean and variance 


4 5 Jensen Inequality 

Definition 14 Convex function A continuous function )withdomam 
and counterdomain the real line is called convex if for every X 0 on the 
real line there exists a line which goes through the point (x 0 , g{x 0 )) and 
lies on or under the graph of the function g{ ) //// 

Theorem 6 Jensen inequality Let X be a random variable with mean 
S[X) and let g{ ) be a convex function, then £[g{X)} ^ 

proof Since g(x) is continuous and convex there exists a line, say 
l(x) ~ a + bx satisfying /(x) = a + bx £ g(x) and ^g(S\X]) 

l(x) is a line given by the definition of continuous and convex that goes 
through the point (£[X] g(f[X])) Note that <?[/(*)] = <f{(a + «)1 = 
a + b£[X) - l(6[X\) hence <Ktf{XD = /(<f [X]) * 6[KX)) 5 <Tfo(X)] [using 
property (iv) of expected values (see Theorem 3) for the last inequality] 

mi 

The Jensen inequality can be used to prove the Rao Blackwell theorem to 
appear in Chap VII We point out that in general <?[p(X)] ^ s(« [X]). for 
example note that g{x) = x 2 is convex, hence ifIX 1 ) £ (tf [X]) 1 , which says 
that the variance of X which is <?[X Z ] — (£[X]) Z , is nonnegative 




4 6 Moments and Moment Generating Functions 

..The moments (or raw moments') of a random variable or of a distribution are 
the expectations of the powers of the random variable which has the given 
distribution^ 
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Definition 15 Moments If X is a random variable , the rth jnoment o f 
X, usually denoted by £ , is defined as * 

05 ) 

if the expectation exists. ~ — //// 

Note that fi\ = £[X] — fi Xy the mean of X. 


Definition 16 Central moments If X is a random variable, the rth 
ce ntral m oment of X about a is defined as £[(X - a) r ]. If a = ]i Xf we 
have the rth central moment of X about fi Xy denoted by /i r , which is 

ti r = n{X-ix x yi ( 16 ) 

//// 

Note that / u t = 6 [(X - //*)] = Oandj^ == £[{X — fi x ) 2 ], the variance of X . 
Also, note that all oddTnoments of X about ji x are 0 if the density function of X 
is symmetrical about (i Xy provided such moments exist. 

In the ensuing few paragraphs we will comment on how the first four 
moments of a random variable or density are used as measures of various 
characteristics of the corresponding density. For some of these characteristics, 
other measures can be defined in terms of quantiles . 


4 


Definition 17 Quantile The qth quantile of a random variable X or of 
its corresponding distribution is denoted by and is defined as the 
smallest number ^ satisfying F*© ^ q. Ill / 

If X is a continuous random variable, then the qth quantile of X is given as 
the smallest number f satisfying F x (£) = q. See Fig. 6. 

Definition 18 Median The median of a random variable X y denoted by 
med*, med ( X ), or £ 50 , is the .5th quantile. //// 

Remark In some texts the median of X is alternatively defined as any 
number, say med (X), satisfying P[X <> med (X)] ^ and F[X 1> med (X)] 


If X is a continuous random variable, then the median of X satisfies 

-med (X) /.« 

[ fx(x) dx = i = J fx(x) dx ; 

^ — oo J med (X) 
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so the median of X is any number that has half the mass of X to its right and 
the other half to its left, which justifies use of the word “median ” 

We have already mentioned that ], the firs t momen t, locat esthe^ccnter 

of the density of x j The, jnedian'o^ A" is also > usejjto t indicate a central 
location of the density of A" A third measure of location of the density of X, 
though not necessarily a measure of central location, is the mode of X, which is 
defined as that point (if such a point exists) at which f x { ) attains Us maximum 
Other measures of location [for example, }($ 2 s + f 75 )] could be devised, but 
three, mean, median, and mode, are the ones commonly used 

We previously mentioned that the second moment a bout the mean, the^ 
variance of a distribution, measures the spread or dispersion of a distribution 
Let us look a little further into tKelmarmer in which the variance characterizes 
the distribution Suppose that/,(x) and f^x) are two densities with the same 
mean p such that 

f + V,to -/ 2 (*)1 dx > 0 (IT) 

for every value of a Two such densities are illustrated in Fig 7 It cart be 
shown that in this case the variance erj in the first density is smaller than the 



FIGURE 
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variance o\ in the second density. We shall not take the time to prove this in 
detail, but the argument is roughly this: Let 

g(. x ) =/iW 

co 

where f^x) and f 2 {x) satisfy Eq. (17). Since J g(x) dx = 0, the positive area 

— 00 

between g(x) and the x axis is equal to the negative area. Furthermore, in 
view of Eq. (17), every positive element of area g{x') dx' may be balanced by a 
negative element g{x”) dx n in such a way that is further from g than 
When these elements of area are multiplied by (x — g) 2 , the negative elements 
will be multiplied by larger factors than their corresponding positive elements 
(see Fig. 8) ; hence 

f (x-g) 2 g(x)dx<0 

* — co 

unless f t (x ) and f 2 (x) are equal. Thus it follows that <j\ <o\. The converse 
of these statements is not true. That is, if one is told that o\ < o \ , he cannot 
conclude that the corresponding densities satisfy Eq. (17) for all values of a; 
although it can be shown that Eq. (17) must be true for certain values of a. 
Thus the condition < o\ does not give one any precise information about 
the nature of the corresponding distributions, but it is evident that /,(x) has 
more area near the mean than f 2 (x ), at least for certain intervals about the mean. 

We indicated above how variance is used as a measure of spread or 
dispersion of a distribution. Alternative measures of dispersion can be defined 
in terms of quantiles. For example £ <75 — £ 25 , called the interquartile range 9 
is a measure of spread. Also, — ^_ p for some i<p< 1 is a possible 
measure of spread. 

The third moment g 3 about the mean is sometimes called a measure of 
asymmetry, or skewness.^ Symmetrical distributions like those in Fig. 9~caTTbe 
shown to have g 3 = 0. curve shaped like J x {x) in Fig. !07s”said to bzskewed 
to Ih e l e/Ta n d c a n be shown to have a negative third moment about the mean; 
one shaped like f 2 (x) is called skewed to the right and can be shown to have a 
positive third moment about the mean. Actually, however, knowledge of the 
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FIGURE 9 



third moment gives almost no clue as to the shape of the distribution, and ft! 
mention it mainly to point out that fact Thus, for example, the density /j(x) 
m Fig 10 has = 0, but it is far from symmetrical By changing the curve 
slightly we could give it either a positive or negative third moment The ratio 
njo 3 , which is unitless, is called the coefficient of skenness 

The quantity a - (mean — median)/(standard deviation) provides an 
alternative measure of skewness It can be proved that - 1 £ a £ 1. 

The fourt h moment about the mean is som etimes used as a measure of 
excess or kurtosts, w hich is the degree of flat ness of a density n e ants -c enter 
Positive values of pi Jo* - 3, called the coefficient of excess or kurtosts, are 
sometimes used to indicate that a density is more peaked around its center than 
the density of a normal curve (see Subsec 3 2 of Chap III), and negative values 
are sometimes used to indicate that a density is more flat around its center than 
the density of a normal curve This measure, however, suffers from the same 
fading as does the measure of skewness, namely, it does not always measure 
what it is supposed to 

While a particular moment or a few of the moments may give little 
information about a distribution (see Fig 1 1 for a sketch of two densities having 
the same first four moments See Ref 40 Also sec Prob 30 in Chap 111), 
the entire set of moments (ji\, , /i’ 3 , . ) will ordinarily determine the dfstri- 
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we sha ll see, but t he third and high er moments are rarely useful” Ordinarily 
one doeslioTlcn^vrwhat distribution function one is working with in a practical 
problem, and often it makes little difference what the actual shape of the distri- 
bution is. But it is usually necessary to know at least the location of the 
distribution and to have some idea of its dispersion. These characteristics can 
be estimated by examining a sample drawn from a set of objects known to have 
the distribution in question. This estimation problem is probably the most 
important problem in applied statistics, and a large part of this book will be 
devoted to a study of it. 

We now define another kind of moment, factorial moment. 

Definition 19 Factorial moment If X is a random variable, the rth 

factorial moment of X is defined as (r is a positive integer): 

<f[X(X-l)*--(X-r+l)]. ( 18 ) 

111! 

For some random variables (usually discrete), factorial moments are 
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easier to calculate than raw moments However the raw moments can be 
obtained from the factorial moments and vice versa 

The moments of a density function play an important role in theoretical 
and applied statistics In fact, in som e cases , if all the moments are known 
_ thc density can be determined will b e discussed briefly at the end of 

this subsection Since the mome nts ofa§e nsityaie~feiport 5lit;'VwduId be 
useful if a function could be found that would give us a represent atlonoF a!! 
tKe moments SucH aTunction is called a moment generating Junction™ 


Definition 20 Moment generating function Let X be a random vari- 
able with density f x ( ) The expected value of e 1 * is defined to be the 
moment generating function of X if the expec ted value exists for ev ery 
value of t i n some interv al -h <t <h, h> 0 The moment generating 
function, denoted by m x {t) or m(l), » 

« Of) = = / V/aC*) dx (19) 

if the random variable X is continuous and is 

if the random variable is discrete //// 

One might note that a moment generating function is defined in terms of 
a density function, and since density functions were defined without reference 
to random variables (sec Definitions 6 and 9), a moment generating function 
can be discussed without reference to random variables 

If a moment generating function exists, then m[t) is continuously differ- 
entiable in some neighborhood of the origin If we differentiate the moment 
generating function r times with respect to t, we have 

£ m(l )= f\'e>r /j((x)dXt ( 20) 

and letting t -* 0, wc find 

( 21 ) 

where the symbol on the left is to be interpreted to mean the rth derivative of 
m(t) evaluated as t -+ 0 Thus the moments of a distribution may be obtained 
from the moment generating functio n b y differentiation, hence its name " 

IHn Eq (19) we replace e‘ r by its series expansion, we obtain the senes 
expansion of m[t) m jcrms-of-ihe moments of/j( YTTfriis ~ 
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= +xt + ^(X'0’ + “j(Xt) 3 + ***] 

“I +Pi* +Jj^2» 2 +'" 

",£ r^' ( 22 > 

from which it is again evident that may be obtained from m(l); n' r is the co- 
efficient of t'jrl . 

\- 

EX AMPLE 17 Let A be a random variable with probability density function 
given by f x {x) «= ;.c~' L 7 (0> *.,(*). 

"tj(0 = ~ f </x = r^— for / < A. 

m *(0 * *= ~~ji hence m’(O) = £[X] m j . 

And w'(0 = ™~ 5 . som'(0) = <f[X 2 ] = ^. //// 


EXAMPLE 18 Consider the random variable A' having probability density 
fund ion f x (x) - x~ 2 / lu ^(x). (See Example 13.) Ifthcmomcntgcncrat- 
ing function of X exists, then it is given by \f x~ 3 c ,x dx. It can be 
shown, however, that the integral docs not exist for any / > 0, and hence 
the moment generating function docs not exist for this random variable X. 

Illl 

As with moments, there is also a generating function for factorial moments. 

Definition 21 Factorial moment generating function Let A' be a ran- 
dom variable. The factorial moment generating function is defined as 
<?{/*] if this expectation exists. //// 

The factorial moment generating function is used to generate factorial 
moments in the same way as the raw moments arc obtained from except 
that t approaches 1 instead of 0. It sometimes simplifies finding moments 
of discrete distributions. 
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EXAMPLE 19 Suppose X has a discrete density function given by 
for jr» 0, 1,2, 


Then 

*-0 X\ 

It ~ It = Ae,t '" u * henCe “ A llll 

In addition to raw moments, central moments, and factorial moments, 
there are other kinds of moments, called cumulanis, or semi invariants Cuimi- 
lants will he defined in terms of the cumulant generating function We will not 
make use of cumulants in this book 

Definition22 Cumulant and cumulant generatingfunction The logarithm 
of the moment generating function of X is defined to be the cumufanf 
generating function of X The rth cumulant of X, denoted by K,{X) or k„ 
is the coefficient of t'(r ' m the Taylor senes expansion of the cumulant 
generating function //// 

A moment generating function is used, as its name suggests, to generate 
moments That, however, will not be its only use for us An important use 
will be in determining distributions 

Theorem 7 Let X and Y be two random variables with densities f x () 
and ftO, respectively Suppose that m x {t) and m r (t) both exist and are 
equal for all / in the interval —h < t <hfot some h > 0 Then the two 
cumulative distribution functions Fj() and F r {) are equal //// 


A proof of the above theorem can be obtained using certain transform 
theory that is beyond the scope of this book We should note, however, what 
the theorem asserts It says that if we can find the moment generating function 
of a random variable, then, theoretically, we can find the distribution of the 
random variable since there is a unique distribution function for a given moment 
generating function This theorem will prove to be extremely useful in finding 
the distribution of certain functions of random variables In partic ular , see 
Sec 4 of Chap V. 
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EXAMPLE 20 Suppose that a random variable X has a moment generating 
function m x (t) = 1/(1 — r) for — 1 < f < 1 ; then we know that the den- 
sity of X is given by f x {x) — e~ x ( x ) since we showed in Example 
17 above that Xe~ Xx l iQtXXi) (x) has 2/(A — t) for its moment generating 
function. jjjj 


Problem of moments We have seen that a density function determines a set 
of moments p * u ... when they exist. One of the important problems in 
theoretical statistics is this: Given a set of moments, what is the density function 
from which these moments came, and is there only one density function that 
has these particular moments? We shall give only partial answers. First, 
there exists a sequence of moments for which there is an infinite (nondenumer- 
able) collection of different distribution functions having these same moments. 
In general, a sequence of moments p \ , p' 2i ... does not determine a unique 
distribution function. However, we did see that if the moment generating 
function of a random variable did exist, then this moment generating function 
did uniquely determine the corresponding distribution function. (See Theorem 
7 above.) Hence, there are conditions (existence of ihe moment generating 
function is a sufficient condition) under which a sequence of moments does 
uniquely determine a distribution function. The general problem of whether or 
not a distribution function is determined by its sequence of moments is 
referred to as the problem of moments and will not be discussed further. 


PROBLEMS 

1 (a) Show that the following are probability density functions (p.d.f.’s): 


fi(x) = e-'/co.-ojCx) 

/;W = 2e~ 2x I [0 , «>(*) 

f(x) =(0+ l)/,(x) - Qfzix) 0 < 6 < 1. 

(b) Prove or disprove: If /i(x) and fi{x) are p.d.f.’s and if 0i + 6 2 = 1, then 
A(x) +0 2 f 2 {x) is a p.d.f. 

2 Show that the following is a density function and find its median: 


a 2 (a 4 * 2a:) x(2a + x) r n 

fM = a(a + *) J h °’ “ ° 

3 Find the constant K so that the following is a p.d.f. 
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4 Suppose that the cumulative distribution function (c.d f ) F,(jt) can be written 
as a function of Or — «)/£, where a. and /? > 0 are constants, that is, x, a, and |! 
appear m F x { ) only in the indicated form 

(o) Prove that if a is increased by &«, then so is the mean of X 
(6) Prove that if £ is multiplied by k(k > 0), then so is the standard deviation 
of X 

5 The experiment is to toss two balls into four boxes m such a way that each ball 
is equally likely to fall in any box Let X denote the number of balls in the first 
box 

(a) What is the c.d f of XI 

(fe) What is the density function of XI 

(c) Fmd the mean and variance of X 

6 A fair com is tossed until a head appears Let X denote the number of tossts 
required 

(а) Find the density function of X 

( б ) Find the mean and variance of X 

(c) Find the moment generating function (m g f) of X 
*7 A has two pennies, B has one They match pennies until one of them has all 
three Let X denote the number of trials required to end the game. 

(а) What is the density function of XI 

(б) Find the mean, and variance of X 

(c) What is the probability that B wins the game? 

S Let /«<*) -0/0X1 - |(* — «♦»>(•*). where a and jB are fixed con 

stants satisfying — oo < « < oo and /? > 0 
0 a i) Demonstrate that/r( ) is a p d f, and sketch it 
(iO Find the c d f corresponding to/*( ) 

(e) Find the mean and variance of X 

(d) Fmd the qth quantile of X 

9 Let/*00 = Klim ~Vx~ * .♦« (*), where - oo<«<coandj9>0 

(а) Find k so that/d ) is a p d f , and sketch the p d f 

(б) Find the mean, median and variance of X 

(c) Find £\\X— «|] 

(d) Find the qth quantile of X 

10 Let fx(x) = tf0J,o uO) +■ Ift 2 j(x) + (1 — 6)J(j where 6 is a fixed constant 
satisfying 0 £<? £ 1 

(а) Find the c d f of X 

(б) Find the mean, median, and variance of X 

11 Let f{x, 1) + (1 — &) fix, 0), where 9 is a fixed constant satisfying 

0 £ B ;£ I Assume that /( , 0) and/( , 1) are both p d f *s 

(a) Show that/( , 9) is also a p d f 

{by Fmd the mean and variance of /( ,ff) m terms of the mean and variance of 
/( ,0)and/( , 1), respectively 

(c) Find the m g f of/( , 6) in terms of the m g.f ’s of/( , 0) and /( , ]) 
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12 A bombing plane flics directly above a railroad track. Assume that if a large 
(small) bomb falls within 40 (15) feet of the track, the track will be sufficiently 
damaged so that traffic will be disrupted. Let X denote the perpendicular 
distance from the track that a bomb falls. Assume that 

, , . 100 -AT 

Mx) 5000 " /f °-‘ oo,( - v) - 

to Find the probability that a large bomb will disrupt traffic. 

(b) If the plane can carry three large (eight small) bombs and uses all three 
(eight), what is the probability that traffic will be disrupted? 

13 (o) Let A" be a random variable with mean /x and variance a 5 . Show that 

<f[(A'-— />}*], as a function of 6, is minimized when b — /x. 

*(/>) Let A' be acontinuous random variable with median™. Minimize «f[ (A'— b\] 
as a function of b. Hint: Show that 6[\X — b\) — 6[\X — m\)4- 

2 Jr (* ~b]fx(x)dx. 

14 to If X is a random variable such that SIX] — 3 and (T[A' 2 ]=sJ3 f use the 

Chebyshev inequality to determine a lower bound for P[~ 2 < X < 8], 

(b) Let X be a discrete random variable with density 

fx(x) + 5Ao, (*) 4- \1 mto. 

For k~ 2 evaluate P[\X-px\ ^>ko x ). (This shows that in general the 
Chebyshev inequality cannot be improved.) 

(r) If X is a random variable with 6[X] « p. satisfying P[X <, 0] = 0, show that 
P[X>2fi] <, J. 

15 Let A' be a random variable with p.d.f. given by 


fx(x) — 1 1 — ^r| /to. jjto. 

Find the mean and variance of A', 

16 Let A' be a random variable having c.d.f. 

Fx(x)~pI{(x) + (\-p)G{x), 

where p is a fixed real number satisfying 0 <p < i* 

//to =xl l0 . uto 4* A,. x,toi 


and 

G(x ) « Ia7 ( o. 2 ]W + /(j,«)to* 

to Sketch Fxto for p ~ 

(b) Give a formula for the p.d.f. of X or the discrete density function of X , 
whichever is appropriate. 

(c) Evaluate P[X<Zl\X£ll 

17 Docs there exist a random variable X forwhich P[/x x - 2o x X <> p x 4-2 a x ] = .6? 
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18 An urn contains balls numbered 1 2,3 First a ball is drawn from the urn. 
and then a fair com » tossed the number of times as the number shown on the 
drawn ball Find the expected number of heads. 

19 If X has distribution given by P[X‘=0]=3PlX**2)**p and Pf* «= JJ «, j 

for 0 for what p is the variance of A' a maximum? 

20 If X is a random variable for which PI X ^0)^0 and & J X] «=» n < co, prose that 

PI AT I — 1/f for every r^l 

21 Given the c d f 

F x {x) «=>D for x < 0 

=»A* + 2 for0£x< J 
= x for 5;£v<I 

*» 1 for J ^ x 

(a) Express F x (x) in terms of indicator functions 
(6) Express P,(x) m the form 

ctF~(x) + bF*(x) 

where f"( ) is an absolutely continuous c d f and F*( ) is a discrete 1 6 f 

(c) Find P[ 25 < Af < 35] 

(d) Find PI 25 < X < 5] 

22 Let/fx) - Ke-’i\-e ■•)?,* «,<*) 

(o) Find K such that /( > is a density function 

(b) Find the corresponding c d f 

(c) Find PI AT >1] 

23 A com is tossed four times Let X denote the number of tunes a head is followed 
immediately by a tad Find the distribution mean, and variance of X 

24 Let /»(r, (?) =* {8x + l)f,.» n (x) where 6 is a constant 

(а) For what range of values of 8 is/, ( , 6) a density function? 

(б) Find the mean and median of X 

(e) For what values of 6 is var maximized? 

25 Let AT be a discrete random variable with the nonnegative integers as values 

Note that = j£HP lX-=>f} HenceJ[r*)ua probability generating function 
of X, inasmuch as the coefficient of /'gives FIX Find <f(r J J for the random 
variable of Pro bs 6 and 7 
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SPECIAL PARAMETRIC FAMILIES OF 
UNIVARIATE DISTRIBUTIONS 


1 INTRODUCTION AND SUMMARY 

The purpose of this chapter is to present certain parametric families of univariate 
density functions time have standard names. A parametric family of density 
functions is a collection of dcnsiiy functions that is indexed by a quantity called 
a parameter . For example, let /(.v; A) « ?.c~ x *I {0tXXi) (x) i where A > 0; then for 
each A > 0, /(* ; A) is a probability density function. A is the parameter, and as 
A ranges over the positive numbers, the collection {/(* ; A): A > 0} is a parametric 
family of density functions. 

The chapter consists of three main sections: parametric families of dis- 
crete densities arc given in one; parametric families of probability density func- 
tions arc given in another, and comments relating the two arc given in the final 
section. For most of the families of distributions introduced, the means, 
variances, and moment generating functions are presented; also, a sketch of 
several representative members of a presented family is often included. A 
table summarizing results of Secs. 2 and 3 is given in Appendix B. 
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2 DISCRETE DISTRIBUTIONS 

In this section we list several parametric families of univariate discrete densities 
Sketches of most are given , the mean and variance of each are derived, and usually 
examples of random experiments for which the defined parametric family 
might provide a realistic model are included 

The parameter (or parameters) indexes the family of densities For 
each family of densities that is presented, the values that the parameter C3n 
assume will be specified There is no uniform notation for parameters, both 
Greek and Latin letters are used to designate them 

2 1 Discrete Uniform Distribution 

Definition 1 Discrete uniform distribution Each member of the family 
of discrete density functions 

(I for x = 1, 2, ,n\ 

(I) 

[0 otherwise I 

where the parameter N ranges over the positive integers, is defined to have 
a discrete uniform distribution A random variable X having a density 
given in Eq (I) is called a discrete uniform random variable //// 


17 L 
0 


figure I 

Density of discrete uniform 


Theorem 1 If X has a discrete uniform distribution, then d[X] = 
(jV + 1)/2, 


var [X] = and m,(l) = d{e"] = £) e"I. 
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PROOF 

„ V1 £ . 1 N + I 


var [A'] = <?[A 2 ] - = £ j* 1 - ^±1)' 

N(A/ + l) (2iV + 1) (N + l) 2 _ (N + \)(N - 1) 
6 A/ 4 12 

<ft^]= £e*i. 

J“1 iv 


//// 


Remark The discrete uniform distribution is sometimes defined in 

density form as/(.v; yV) = [1/(N + I)]/ {(M nj(-v), for N a nonnegative 

integer. If such is the case, the formulas for the mean and variance have 
to be modified accordingly. //// 


2.2 Bernoulli and Binomial Distributions 


Definition 2 Bernoulli distribution A random variable X is defined 
to have a Bernoulli distribution if the discrete density function of X is 
given by 


fx(x ) =fx(xip) 

f p x (\ -p) l ~ x for A- = 0 or n 

= | \=PV -p)'- x hoM ( 2 ) 

0 otherwise J 

where the parameter p satisfies 0 <p < 1. I — p is often denoted by q. 

mi 


q \p 


FIGURE 2 
Bernoulli density. 
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1 


x 
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Theorem 2 If X has a Bern oulli distribution, then 

S[X]=p, var [AT = pq, and m x (i)^pe'+q ( 3 ) 

PROOF <f[X]=0 9 + 1 p~p 

var m * £[X 2 ] - MX]) 1 = 0 2 q + 1 3 -p - p* = pq 

"bft) = *[«"] =*+/** fill 

EXAMPLE 1 A random experiment whose outcomes have been classified 
into two categories, called “success” and “failure,” represented by the 
letters d and /, respectively, is called a Bernoulli trial If a random 
variable X ts defined as 1 if a Bernoulli trial results in success and 0 if 
the same Bernoulli trial results in failure, then X has a Bernoulli distribu- 
tion with parameter p - /'[success] //// 

EXAMPLE 2 For a given arbitrary probability space (fi, sf, P[ ]) and for A 
belonging to s/, define the random variable A' to be the indicator function 
of A, that is, X(it>) — / 4 (<u), then X has a Bernoulli distribution with 
parameter p « P[X = 1] =/’[/!] //// 

Definition 3 Binomial distribution A random variable X is defined to 



FIGURE 3 
Binomial densities 
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where the two parameters n and p satisfy 0 <p < 1, n ranges over the 
positive integers, and q = 1 ~ p. A distribution defined by the density 
function given in Eq. (4) is called a binomial distribution. HU 

Theorem 3 If A" has a binomial distribution, then 
<?IXI = np, var = npq, and m x {i) ~{q + pe')\ (5) 

PROOF 

»*(0 - = i o fy(peyq'-* 

= (pe' + q) n . 

Now 

/«*(») = npe'(pe’ +q) n ~ 1 

^ v *y 3 _. f 

and £{.•*) s (p i y ^ c . . ?Jt% 

m x {t) = n(n - \)(pe') 2 (pe‘ + q) n ~ 2 + npe'(pe’ + q) n ~ l ; 

hence 

S[X] = m' x (0) = np 

and 

var [X]=S[X 2 ]-(S(X]) 2 

= m x (0) - (np) 2 - n(n - 1 )p 2 + np - (np) 2 = «rp(l - p). //// 

Remark The binomial distribution reduce s to the Bernoulli distribution 
when n = L Sometimes the Bernoulli distribution is called the point 
binomial . //// 


EXAMPLE 3 Consider a random experiment consisting of n repeated inde- 
pendent Bernoulli trials when p is the probability of success <6 at each 
individual trial. The term “repeated” is used to indicate that the prob- 
' ability of a remains the same from trial to trial. The sample space for 
such a random experiment can be represented as follows: 

Q = {(Zj, ^2 > • • * » ^n) ■ ^ /O* 

z { indicates the result of the zth trial. Since the trials are independent, 
the probability of any specified outcome, say {(/, /, o, , /, <?)}, 
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is given by qqpqpp qp Let the random variable X represent the num- 
ber of successes m the n repeated independent Bernoulli trials Now 
P[X=x] = /^exactly x successes and n-x failures in n trials]® 


^”jpV~* foT * =0, 1, ,n since each outcome of theexpenment that has 
exactly x successes has probability p V~* and there are j such outcomes 


Hence X has a binomial distribution 


//// 


EXAMPLE 4 Consider sampling with replacement from an urn containing 
M balls, K of which are defective Let X represent the number of defec- 
tive balls m a sample of size n The individual draws are Bernoulli trials 
where “defective” corresponds to “success,” and the experiment of 
taking a sample of size n with replacement consists of n repeated inde- 
pendent Bernoulli trials where p - /’[success] = K/ A/, so X has the 
binomial distribution 


om-s 


for x^O.l, 


which is the same as P[A k ] in Eq (3) of Subsec 3 5 of Chap I, for x~ k 

m 


The sketches in Fig 3 seem to indicate that the terms f x {x t n, p) increase 
monotomcally and then decrease monotomcally The following theorem States 
that such is indeed the case 


Theorem 4 Let Fhave a binomial distribution with density f x (x, ft, p), 
then f x (x - 1 , n, p) <f x (x, it, p ) for *<(« + I )p, f x (x - 1 ; n, p)> 
fx(x, », p) for x > K n + l)p, andf^x - l.n.p) =/*(*, n, p ) ifa » (n + 1 )p 
and (n + l)p is an integer, where x ranges over I, . , n 

PROOF 


fx(x,n,p ) n-x + 1 p (n + i)p — x 

fx(x — 1 , n,p) x q xq ’ 

which is greater than 1 if * < (n + i)p, smaller than I if x > (» + Dp, 
and equal to 1 if the integer x should equal {n + l)p HU 
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2.3 


Hypergeometric Distribution ' 

Definition 4 Hypergeometric distribution A random variable X is 
defined to have a hypergeometric distribution if the discrete density function 
of X is given by 


fx(x; M, K, n) = 


©. 


:9 


C) 

1 


for x = 0, 1, ...,n 


otherwise 


(7) 


El 

c:) 


ho. 


where M is a positive integer, K is a nonnegative integer that is at most M , 
and n is a positive integer that is at most M. Any distribution function 
defined by the density function given in Eq. (7) above is called a hyper- 
geometric distribution . //// 


Theorem 5 if X is a hypergeometric distribution, then 

K A r K M-K M-n 

-"•/? and Varm- "'M M 


( 8 ) 


PROOF 


K 


A (") 

M x?\ | 

/x-i 

\/a/-i -K + i\ 

K n ^' \ y 

A / 

= n • — 2, 

M y = 0 

/A/-n 


\n — 1 / 

K 


= n 'M’ 



+ ^ given in 

\ m J 




A. 
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FIGURE 4 

Hyperseoinetfjc densities 



Remark If we set KjM - p, then the mean of the hypergeometric dis- 
tribution coincides with the mean of the binomial distribution, and the 
variance of the hypergeometric distribution is (Af — n)]{M — 1} times the 
variance of the binomial distribution fill 
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EXAMPLE 5 Let X denote the number of defectives in a sample of size // 
when sampling is done without replacement from an urn containing M 
balls, K of which are defective. Then X has a hypergeometric distribution. 
See Eq. (5) of Subsec. 3.5 in Chap. L //// 

2.4 Poisson Distribution 
^ — \ 

Definition 5 Poisson distribution A random variable X is defined to 
have a Poisson distribution if the density of X is given by 

i“j- for * = 0, 1,2, ... ) 

/*(*) =/*(*; » =- -= e ~^f f,o. (9) 

^ 0 otherwise > 

where the parameter A satisfies ). > 0. The density given in Eq. (9) is 
called a Poisson density. [Ill 

Theorem 6 Let X be a Poisson distributed random variable; then 

6{X] = X, var [X] = A, and m x (t) = (JO) 



FIGURE 5 
Poisson densities. 
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PROOF 




hence, 

mx(t) = Xe- y e'e irt 
and 

m£(0 = Xe' k e'e“[M +1] 

So, 

OT-witOW 

and 

var [AT] = S[X l \ - (<Spf]) J = mj(0) - X 1 = J{jt + 1 1 - k* - X //// 

The Poisson distribution provides a reali stic m odel_ror_inajixjando m 
phenomena Since the values of a Poisson random varia ble are t henonnega- 
tive integers, any random phenomenon for which a count of s ome sort is of 
interestls a candidate for modeling by assuming a Poisson distribution Such 
a count might be the number of fatal traffic accidents per week m a given stat e, 
the numbeT of radioactive particle emissions peT unit of tune, the number of 
telephone calls per hour comtng into the switchboard of a large business the 
number of meteorites that collide with a test satellite during a single orbit, 
the number of organisms per unit volume of some fluid the number of defects 
per unit of some material, the number of flaws per unit length of some wire, 
etc Naturally, not all counts can be realistically modeled with a Poisson dis 
tnbution, but somecan , in fact, if certain assumptions regarding the phenomenon 
under observation are satisfied, the Poisson mode) is the correct model 

Let us assume now that we are observing the occurrence of certa in happen- 
ings in time, space, region or length A happening might be a fatal traffic 
accident, a particle emission, the arrival of a telephone call, a meteorite col- 
lision, a defect m an area of material, a flaw in a length of wire, etc We will 
talk as though the happenings are occurring in fime, although happenings 
occurring in space or length are appropriate as well The occurrences of the 
happening in time could be sketched as in Fig 6 An occurrence of a happen- 
ing is represented by x , the sketch indicates that seven happenings occurred 
between time 0 and time t, Assume now that there exists a positive quantity, 
say v, which satisfies the following 
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X 


FIGURE 6 


(i) The probability that exactly one happening will occur in a small 
time interval of length A is approximately equal to vA, or P[one happening 
in interval of length A] = vA 4- o(A). 

00 The probability of more than one happening in a small time interval 
of length A is negligible when compared to the probability of just one 
happening in the same time interval, or P[two or more happenings in 
interval of length A] = o{h). 

(iii) The numbers of happenings in nonoverlapping time intervals are 
independent* 

The term o(A), which is read ‘'some function of smaller order than A,’* 
denotes an unspecified function which satisfies 

lim^-0. 

/i- o A 

The quantity v can be interpreted as the mean rate at which happenings occur per 
unit of time and is consequently referred to as the mean rate of occurrence . 


Theorem 7 If the above three assumptions are satisfied, the number of 
occurrences of a happening in a period of time of length t has a Poisson 
distribution with parameter X = vt. Or if the random variable Z(t) 
denotes the number of occurrences of the happening in a time interval 

of length r, then P[Z{t) = z] = e~ x \vtYjz\ for z = 0, 1,2 

We will outline two different proofs, neither of which is mathemati- 
cally rigorous. 

proof For convenience, let t be a point in time after time 0; so the 
time interval (0, /] has length f, and the time interval (f, t 4 A] has length 
A. Let P n {s) = P[Z(s) = n] — Pfexactly n happenings in an interval of 
length s]; then 

Poi* + A) =P[ no happenings in interval (0, t 4- A]] 

= P[no happenings in (0, t] and no happenings in (f, t 4 A]] 

= P[no happenings in (0, f]]P[no happenings in ( t 9 1 4 A]] 

= Po«F 0 (A), 

using (iii), the independence assumption. 
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Now Pino happenings in (f, t + /ij] = 1 -Ptone or more happenings 
in (f, f + /tl] *= l - Pfone happening in (f, / + A]I-Ptmore than erne 
happening in (/, t + A]] = I — \h — o(A) — o{h ) , so P o(f + A) ~PqU) 
[1 - vA - o(h) - o(A)J, or 

ra+ip-m . _ , m _ fM 

h it 

and on passing to the limit one obtains the differential equation P' 0 {i) = 

- vP^O), whose solution ts Po(f) = c _w , using the condition P o (0) — 1 
Similarly, P,(< + h) -P,(0Po(A) + Po('WA), or P,(t + A) = P,(/)II - »A 

- o(A)l + Po(0t v A + o(A)l, which gives the differential equation P{{t) = 

- vP,{f) +• vP 0 (l), the solution of which is given by P,(i) = vie” ", using 
the initial condition P,(0) » 0 Continuing in a similar fashion one 
obtains P„{/) = — vP„(t) + vP„-, (t), for n — 2, 3, 

It is seen that this system of differential equations is satisfied by 

P„(0 = (vt)*c“’7 n ’ 

The second proof can be had by dividing the interval (0, /) into, say 
n time subintervals, each of length A *» t/n The probability that k 
happenings occur in the interval (0, t) ts approximately equal to the prob- 
ability that exactly one happening has occurred in each of k of the n 
subintervals that we divided the interval (0, f) into Now the probability 
of a happening or * success,” in a given submterva! ts \h Each sub- 
interval provides us with a Bernoulli trial, either the submterval has a 
happening or it does not Also, in view of the assumptions made, these 
Bernoulli trials are independent, repeated Bernoulli trials, hence the 
probability of exactly k * successes” in the n trials is given by (see 
Example 3) 




which ■$ an approximation to the desired probability that A happenings will 
occur in time interval (0, r) An exact expression can be obtained by 
letting the number of subintervals increase to infinity, that is, by letting n 
tend to infinity 


m 




as n -+ co, noting that 1 1 


» I, and (rt)Jn k -* I 

Hll 
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Theorem 7 gives conditions under which certain random experiments in- 
volving counts of happenings in time (or length, space, area, volume, etc.) can 
be realistically modeled by assuming a Poisson distribution. The parameter v 
in the Poisson distribution is usually unknown. Techniques for estimating 
parameters such as v will be presented in Chap. VII. 

In practice great care has to be taken to avoid erroneously applying the 
Poisson distribution to counts. For example, in studying the distribution of 
insect larvae over some crop area, the Poisson model is apt to be invalid since 
insects lay eggs in clusters entailing that larvae are likely to be found in clusters, 
which is inconsistent with the assumption of independence of counts in small 
adjacent subareas. 


EXAMPLE 6 Suppose that the average number of telephone calls arriving 
at the switchboard of a small corporation is 30 calls per hour, (i) What 
is the probability that no calls will arrive in a 3-minute period? (ii) What 
is the probability that more than five calls will arrive in a 5-minute interval? 
Assume that the number of calls arriving during any time period has a 
Poisson distribution. Assume that time is measured in minutes; then 30 
calls per hour is equivalent to .5 calls per minute, so the mean rate of 
occurrence is .5 per minute. P[no calls in 3-minute period] = e~ vt = 
= e~ 1 '* * .223. 


« e~ vt (vt) k 

P[more than five calls in 5-minute interval] = £ — — — 


* <r ( - 5H5) (2.5) fc 

kh kl 


.042. 


till 


EXAMPLE 7 A merchant knows that the number of a certain kind of item 
that he can sell in a given period of time is Poisson distributed. How 
many such items should the merchant stock so that the probability will be 
.95 that he will have enough items to meet the customer demand for a time 
period of length 7? Let v denote the mean rate of occurrence per unit 
time and K the unknown number of items that the merchant should stock. 
Let X denote the number of demands for this kind of item during the time 
period of length T. The solution requires finding K so that P[X < K] 

K 

> .95 or finding K so that £ [e~ vT (vT) k jkl] > .95. In particular, if the 

k=0 

merchant sells an average of two such items per day, how many should 
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he stock so that he will have probability at least 95 of having enough 
rttna vo meet demand for a 30-day month? Find K so that 


or find K so that 


K tf-UHJOJgO* 

F 


2: 95, 


The desired K can be found using an appropriate Poisson table {e g , 
Molina, 1942 145]) h is K =* 73 //// 


EXAMPLE 8 Suppose that flaws in plywood occur at random with an average 
of one flaw per 50 square feet What is the probability that a 4 foot 
x 8 foot sheet will have no flaws’ At most one flaw’ To get a sot 
ution assume that the number of flaws per unit area is Poisson distributed 

P[no flaw's] = A 32 « e~ 64 sa 527 

i’fat most one flaw] = e' 64 + 64e~ 6 * a 865 //// 

A Poisson density function, like the binomial density, possesses a certain 
monotomcity that is precisely stated in the following theorem 


Theorem 8 Consider the Poisson density 


e~ l X k 

~ fort 

= 0, 1, 2, 

e ~*)* 


{*-])■*“ 
t-W' 1 e~W 


(fc-1)^ A* 

for k > X, 


~*X k ~ l e'W 


(*-!)< 


if A is an integer and a' - A 


PROOF 

e-W-'Kk- 1 )» k 

e"W V 

which is less than I if k < A, greater than 1 if k > X, and equal to 1 if X 
is an integer and A = JL ffff 
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^C_ 

2.5 Geometric and Negative Binomial Distributions 

Two other families of discrete distributions that play important roles in statistics 
are the geometric (or Pascal) and negative binomial distributions. The reason 
that we consider the two together is twofold ; first, the geometric distribution is a 
special case of the negative binomial distribution, and, second, the sum of 
independent and identically distributed geometric random variables is negative 
binomialiy distributed, as we shall see in Chap. V. In Subsec. 3.3 of this chapter, 
the exponential and gamma distributions are defined. We shall sec that in 
several respects the geometric and negative binomial distributions are discrete 
analogs of the exponential and gamma distributions. 


Definition 6 Geometric distribution A random variable X is defined 
to have geometric (or Pascal) distribution if the density of X is given by 

/ a (- v ) =fx(x;p) 

>(1-/7)* for a = 0, 1 , . . .) 

=/>o (ii) 

0 otherwise j 

where the parameter p satisfies 0 < p <> 1 . (Define q = 1 - p.) //// 

Definition 7 Negative binomial distribution A random variable X 
with density 

A (a) =A (v;r,p) 

( r+ rV- ('>-•)■ 

0 

= ( r+ * 

where the parameters r and p satisfy r= I, 2, 3, ... and §<p< t \ 
(< 7 = 1 — p), is defined to have a negative binomial distribution. The 
density given by Eq. (12) is called a negative binomial density . 

//// 



for x = 0, 1 , 2, . . . 
otherwise (12) 


Remark If in the negative binomial distribution r = 1, then the negative 
binomial density specializes to the geometric density. till 
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FIGURE 7 
Geometric densities 



1 t 1 r T 

0 12 3-45678 


Theorem 9 If the random variable X has a geometric distribution, 
then 

*[*] = -, varlX]*=%, and m x (t) - (13) 

P P 1 — 

proof Since a geometric distribution is a special case of a negative 
binomial distribution. Theorem 9 is a corollary of Theorem 1 1 //// 

The geometric distribution is well named since the values that the geometric 
density assumes are the terms of a geometric series Also the mode of the 
geometric density is necessarily 0 A geometric density possesses one other 
interesting property, which is given in the following theorem 


Theorem 10 If X has the geometric density with parameter p, then 
P[x £ f + J\ X & 0 * FIX ZJ] Sot l,j « 0, 1 . 2 
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PROOF P{X 2: / + j\ X i] = 


P[X^i+j] 

P[X^i] 


I pU-pY 
__ 

1pV-pY 

x = i 

= (!-/>)' 


0 -p)‘+' 

a -/o* 


//// 


Theorem 10 says that the probability that a geometric random variable is 
greater than or equal to i +j given that it is greater than or equal to / is equal to 
the unconditional probability that it will be greater than or equal to j. We will 
comment on this again in the following example. 


EXAMPLE 9 Consider a sequence of independent, repeated Bernoulli trials 
with p equal to the probability of success on an individual trial. Let the 
random variable ^represent the number of trials required before the first 
success; then X has the geometric density given by Eq. (1 1). To see this, 
note that the first success will occur on trial .v + 1 if this (x -f- l)$t trial 
results in a success and the first * trials resulted in failures; but, by in- 
dependence, x successive failures followed by a success has probability 
(I — p) x p. In the language of this example, Theorem 10 states that the 
probability that at least i *f j trials are required before the first success, 
given that there have been / successive failures, is equal to the uncon- 
ditional probability that at least j trials arc needed before the first suc- 
cess. That is, the fact that one has already observed i successive failures 
does not change the distribution of the number of trials required to obtain 
the first success. //// 


A random variable X that has a geometric distribution is often referred 
to as a discrete waiting-time random variable. It represents how long (in terms 
of the number of failures) one has to wait for a success. 

Before leaving the geometric distribution, we note that some authors 
define the geometric distribution by assuming 1 (instead of 0) is the smallest 
mass point. The density then has the form 

/(*;/>) =/>(! -pY~ l hi,2,...)(x), 


(14) 
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andrtrt mean is 1/p, the variance island the moment generating function is 

pe'ni-qe 1 ) 

Theorem II Let X have a negative binomial distribution, then 
£{X\ = y and m x (t) = [f~?] < J5 > 

PROOF m Jt (t) = die lK ] = x 

[see Eq (33) in Appendix A] 

miOWVOO -VT'"!-*’) 

and 

+• l)e Jr (l - qe')~'~* + e’(l — ge r )''~‘l' 

hence 

and 

var [X] » m;(l)j ( - = rqp , [qp -|,-I (r + 1) + P“ r ~’J - 

_ rq 1 F(7 _ rq 

~ p 2 p p 5 //// 


The negative binomial distribution, like the Poisson, has the nonnegative 
integers for its mass points, hence, the negative binomial distribution is poten- 
tially a model for a random experiment where a count of some sort is of interest 
Indeed, the negative binomial distribution has been applied in population counts, 
in health and accident statistics, in communications, and in other counts as 
well Unlike the Poisson distribution, where the mean and variance are the 
same, the variance of the negative binomial distribution is greater than its mean 
We will see in Subsec 4 3 of this chapter that the negative binomial distribution 
can be obtained as a contagious distribution from the Poisson distribution 
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EXAMPLE 10 Consider a sequence of independent, repeated Bernoulli 
trials with p equal to the probability of success on an individual trial. Let 
the random variable X represent the number of failures prior to the rth 
success; then A' has the negative binomial density given by Eq. (12), as the 
following argument shows; The last trial must result in a success, having 
probability yr; among the first x + r - 1 trials there must be r - 1 successes 
and x failures, and the probability of this is 


which when multiplied by p gives the desired result. 


!W 


A random variable X having a negative binomial distribution is often 
referred to as a discrete waiting-time random variable. It represents how long 
(in terms of the number of failures) one waits for the rth success. 


EXAMPLE 11 The negative binomial distribution is of importance in the 
consideration of inverse binomial sampling . Suppose a proportion p 
of individuals in a population possesses a certain characteristic. If 
individuals in the population arc sampled until exactly r individuals with 
the certain characteristic arc found, then the number of individuals in 
excess of r that arc observed or sampled has a negative binomial dis- 
tribution. HU 


2.6 Other Discrete Distributions 

In the previous five subsections wc presented seven parametric families of uni- 
variate discrete density functions. Each is commonly known by the names 
given. There are many other families of discrete density functions. In fact, 
new families can be formed from the presented families by various processes. 
One such process is called truncation . Wc will illustrate this process by looking 
at the Poisson distribution truncated at 0. Suppose, as is sometimes the case, 
that the zero count cannot be observed yet the Poisson distribution seems a 
reasonable model. One might then distribute the mass ordinarily given to the 
mass point 0 proportionately among the other mass points obtaining the family 
of densities 


r ( a 7*H1 “ c~ x ) for a* =1,2,... 

ALV - 1 0 otherwise. 


(16) 
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A random variable having density given by Eq (16) is called a Poston 
random tanable truncated at 0 

Another process for obtaining a new family of densities from a given 
family can also be illustrated with the Poisson distribution Suppose that a 
random variable X representing a count of some sort has a Poisson distnbu 
tion If the experimenter is stuck with a rather poor counter one that cannot 
count beyond 2 the random variable that the experimenter actually observes 
has density given by 



The counter counts correctly values 0 and 1 of the random variable X but if X 
takes on any value 2 or more the counter counts 2 Such a random variable 
is often referred to as a censored random variable 

The above two illustrations indicate how other families of discrete densities 
can be formulated from existing families We close this section by giving two 
further not so well known families of discrete densities 


Definition 8 Beta binomial distribution The distribution with discrete 
density function 


/(r)=/( r 


* 8) = M + ns + aOrXii + g-*) 
p WiWfl r(i + * + /}) /,0 ‘ 


,l(v) 

(17) 


where n is a nonnegative integer a > 0 and /3 > 0 is defined as the beta 
binomial d stt button 

r(m) IS the well known gamma function r(m) = J £X*~'e~*dx for 
»i > 0 See Appendix A The beta binomial distribution has 


Mean»-^- and variance; + « + ( , 8) 

* + /* (a + 0)*(a + /M-l) 1 ; 

U has the same mass points as the binomial distribution If or =* /? - 1 
then the beta binomial distribution reduces to a discrete uniform distnbu 
tion over the integers 0 1 n //// 
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Definition 9 Logarithmic distribution The distribution with discrete 
density function 


f(x;p) = 


\-xlog e p 


VO 


for x = 1,2,... 


otherwise 




09) 


where the parameters satisfy 0 <p < 1 and q = 1 — p is defined as the 
logarithmic distribution . //// 


The name is justified if one recalls the power-series expansion of Iog e (l - q). 
The logarithmic distribution has 


Mean = 


~P ] °£eP< 


and 


variance = 


<7(<7 + lo ZeP) 
~(P log e p) 2 ‘ 


( 20 ) 


It can be derived as a limiting distribution of negative binomial distributions 
that have been generalized to include r, any positive number (rather than just 
an integer), truncated at 0. The limiting distribution is obtained by letting r 
approach 0. 


3 CONTINUOUS DISTRIBUTIONS 

In this section several parametric families of univariate probability density 
functions are presented. Sketches of some are included; the mean and variance 
(when they exist) of each are given. 


3.1 Uniform or Rectangular Distribution 

A very simple distribution for a continuous random variable is the uniform dis- 
tribution. It is particularly useful in theoretical statistics because it is convenient 
to deal with mathematically. 

Definition 10 Uniform distribution If the probability density function 
of a random variable X is given by 

b — a L 


fx(x) =fx(x; a, b) = 


( 21 ) 
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FIGURE * 

Uniform probabil ty density 


Vi here the parameters a and b satisfy — oo<a<A<oo, then the random 
variable JT is defined to be uniformly distributed over the interval [a, b ] 
and the distribution given by Eq (21) is called a uniform distribution 

//// 

Theorem 12 If X is uniformly distributed ov'er (a 6] then 

var [AT] = , and m x (t) - ~~ ~ (22) 


f* 1 j b 2 —a 2 a + b 

*™-i *— a **‘ 2 r 


var [X] = SIX'] - (/[*])’ « \\ 2 J-dx- 

b 3 -o 3 (a + b) 1 (b — a)* 
~3(fr-a) 4 * 12 

_ .* t e t t _ =•» 

m x «)~dU , *l = j 

J, b - a {b — 


The Uniform distribution gets its name from the fact that its density is 
uniform, or constant, over the interval (a, b] It is also called the rectangular 
distribution— the shape of the density is rectangular 

The cumulative distribution function of a uniform random variable is 
given by 


*= j h* »jW + f{» «>(•*) 


(23) 
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It provides a useful model for a few random phenomena. For instance, if it 
is known that the values of some random variable X can only be in a finite 
interval, say [a, b ], and if one assumes that any two subintervals of [a, b ] of 
equal length have the same probability of containing X, then X has a uniform 
distribution over the interval [a, b}. When one speaks of a random number 
from the interval [0, 1], one is thinking of the value of a uniformly distributed 
random variable over the interval [0, 1 ]. 

EXAMPLE 12 If a wheel is spun and then allowed to come to rest, the point 
on the circumference of the wheel that is located opposite a certain fixed 
marker could be considered the value of a random variable X that is 
uniformly distributed over the circumference of the wheel. One could 
then compute the probability that X will fall in any given arc. //// 

Although we defined the uniform distribution as being uniformly dis- 
tributed over the closed interval [a, b ], one could just as well define it over the 
open interval (a, b) [in which case/*(x) = (6 - o)~ '/«,,!,)(*)] or over either of the 
half-open-lialf-closed intervals ( a , b] or [a, b). Note that all four of the possible 
densities have the same cumulative distribution function. This lack of unique- 
ness of probability density functions was first mentioned in Subsec. 3.2 of 
Chap. II. 


3.2 Normal Distribution 


A great many of the techniques used in applied statistics are based upon the 
normal distribution; it will frequently appear in the remainder of this book. 


Definition 11 Normal distribution A random variable X is defined to 
be normally distributed if its density is given by 


/*(*) =fx(x;n,c) 



e -(x-»)y2c^ 


(24) 


where the parameters // and a satisfy — oo < /! < co and a > 0. Any 
distribution defined by a density function given in Eq. (24) is called a 
normal distribution, //// 


We have used the symbols /z and a 2 to represent the parameters because 
these parameters turn out, as we shall see, to be the mean and variance, respec- 
tively, of the distribution. 
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One can readily check that the mode of a normal density occurs atx = /i 
and inflection points occur at ft - a and n + e (See Fig 9 ) Since the normal 
distribution occurs so frequently in later chapters, special notation is introduced 
for it If random variable X is normally distributed with mean p and variance 
<r\ we will write i ~ N((t a 3 ) We will also use the notation <f > „ ,»(*) for the 
density of X~N(fi o 1 ) and <!>,, ,,(*) for the cumulative distribution function 
[f the normal random variable has mean 0 and variance 1, it is called a 
standard or normalized normal random variable For a standard normal ran- 
dom variable the subscripts of the density and distribution function notations 
are dropped, that is. 


and <D(x) = f <£(«) du (25) 

«/2it ■'-id 

Since ,i( jt) is given to be a density function, it is implied that 


J J>m Ax)dx**l, 

but we should satisfy ourselves that this is true The verification is somewhat 
troublesome because the indefinite integral of this particular density function 
does not have a simple functional expression Suppose that we represent the 
area under the curve by A , then 


A= — L f" e-WW dx, 
y/2na ■'-» 

and on making the substitution y = (x — }i)ja, we find that 


A 


1 

y/2n 


f e-»' dy 
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FIGURE 10 

Normal cumulative distribution 
function. /i 



We wish to show that A — 1, and this is most easily done by showing that A 2 is 
1 and then reasoning that A — 1 since is positive. We may put 

1-00 1 -CO 

A 2 — — f e~ iy2 dy—^=( e~ i:2 dz 

I -00 -00 

= — f f e~* (y2+z2) dy dz 

2k J-co J-co 

by writing the product of two integrals as a double integral. In this integral 
we change the variables to polar coordinates by the substitutions 

y = r sin 9 
z-r cos 0, 

and the integral becomes 


1 r r 2 * . 

= re 

2 k •'o 

-cc 

= f re~* 2 dr 

J n 


re’* 2 dO dr 


Theorem 13 If A' is a normal random variable, 

<f[X] = v, var [X] = o 2 , and m x (t) = e"' + ' J(2/2 . (26) 

PROOF 

m x (t) = S[e ,x ] =£‘6[e ,{X ~ 1 *] 

= e"* f" -4= e ’(x-to e -(i/2f)(x-p)^ dx 

•'-ao 

_ c tB 1 f® e -(l/2*»)[(*-j.)*-2**l{*-n)Jj JC 

V^ J -« 

If we complete the square inside the bracket, it becomes 
(x - n) 2 - 2c 2 t(x - n) = (x - n) 2 ~ 2<j 2 t(x - n) + a A t 2 - a A t 2 
= {x — fi — a 2 t) 2 — <J A t 2 , 
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and we have 


"<(>> = «v-» -4r f tx 

■ v /2iur J -«> 

The integral together with the factor 1/,/iiw is necessarily I since it is the 
area under a normal distribution with mean p + c 2 t and variance a 1 
Hence, 

m x ( 

On differentiating m z (0 twice and substituting 1 = 0, wc find 


and 


var [X\ = <5[X } ] - = wj(0) - p* = a 2 , 


thus justifying our use of the symbols p and for the parameters 


//// 


Since the indefinite integral of t i(x) does not have a simple functional 
form, one can only exhibit the cumulative distribution function as 

% <>{*)* •W" ( 27 > 

The following theorem shows that we can find the probability that a normally 
distributed random variable, with mean p and variance it 2 , falls in any interval 
in terms of the standard normal cumulative distribution function and this 
standard normal cumulative distribution function is tabled in Table 2 of 
Appendix D 


Theorem 14 If X ~ N(p t a 1 ), then 

P[a<X<b)^<S> ® (28) 


PROOF 


Pro < x < b) = f * ~L~ dx 

= f -rb e~i‘* dz 
yjlTl 


III! 
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Remark <b(x) = 1 - <3>(— jc)* //// 

The normal distribution appears to be a reasonable model of the behavior 
of certain random phenomena. It also is the limiting form of many other prob- 
ability distributions. Some such limits are given in Subsec. 4.1 of this chapter. 
The normal distribution is also the limiting distribution in the famous central- 
limit theorem , which is discussed in Sec. 4 of Chap. V and again in Sec. 3 of 
Chap. VI. 

Most students are already somewhat familiar with the normal distribution 
because of their experience with “grading on the curve.” This notion is 
covered in the following example. 

EXAMPLE 13 Suppose that an instructor assumes that a student’s final 
y score is the value of a normally distributed random variable. If the 
instructor decides to award a grade of A to those students whose score 
exceeds /< 4- cr, a £ to those students whose score falls between ft and 
4- a C if a score falls between ft — a and //, a D if a score falls between 
ft — 2a and ft — a, and an F if the score falls below ft — 2a , then the pro- 
portions of each grade given can be calculated. For example, since 

P[X > /j + cr] = 1 -P[X<(i + e] = l 

= 1 - 0(1) *.1587, 
one would expect 15.87 percent of the students to receive A's. //// 

EXAMPLE 14 Suppose that the diameters of shafts manufactured by a cer- 
tain machine are normal random variables with mean 10 centimeters and 
standard deviation .1 centimeter. If for a given application the shaft must 
meet the requirement that its diameter fall between 9.9 and 10.2 centi- 
meters, what proportion of the shafts made by this machine will meet the 
requirement? 

^ /10.2 - 10\ „ (9.9 - 10\ 

P[ 9.9 < X< 10.2] = O I j ) - <D I j — J 

= 0(2) - 0(-l) * .9772 - .1587 = .8185. //// 


o' 


3.3 Exponential and Gam ma Distributions _ 

Two other families of distributions that play important roles in statistics are the 
(negative) exponential and gamma distributions, which are defined in this sub- 
section. The reason that the two are considered together is twofold; first, the 
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exponential is a special case of the gamma, and, second, the sum of independent 
identically distributed exponential random variables is gamma-distributed, as 
we shall see in Chap V 

Definition 12 Exponential distribution If a random variable X has a 
density gnen by 

Mx,X) = Xe- lx I l0 •,(*), ( 29 ) 

where X > 0 then X is defined to have an (negative) exponential ditlnbu 

non HU 

Definition 13 Camma distribution If a random variable X has density 
given by 

fxix, r,A)=~ (J UT .)(*) (J0> 

where r > 0 and A > 0 then X is defined to have a gamma disiribullon 
r{*) is the gamma function and it is discussed in Appendix A //// 

Remark If in the gamma density r •* 1, the gamma density specializes 
to the exponential density fill 

Theorem 15 If If has an exponential distribution, then 

varlAQ = jp and m*(0 “ - — - Tor f<A 

(31) 

proof The exponential distribution was the distribution used as an 
exampleforsomedefimtionsgiven mChap II, and derivations of the above 
appear there Also, Theorem 15 is a corollary to the following theorem 

//// 

Theorem 16 If X has a gamma distribution with parameters r and A, 
then 

=* y var [AT] =* ^ , and m t (i) — | for X < X 

(32) 
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Gamma densities (A — 1). 


PROOF 

mx(0 = *[e' x 3 

— [ ■=rre ,x x r ~ l e~ Xx dx 
J o f(r) 



a-ty 

T(r) 


y i e -Cx-o* fa 





and 


m x( { ) = r (r + l)X r (X — 1)~'~ 2 ; 

hence 


and 


6[X] = m' x { 0)=j 


var [X]=£[X 2 ]~(S[X]) 2 



r(r+ 1) 
A 2 



//// 


The exponential distribution has been used as a model for lifetimes of 
various things. When we introduced the Poisson distribution, we spoke of cer- 
tain happenings, for example, particle emissions, occurring in time. The length 
of the time interval between successive happenings can be shown to have an 
exponential distribution provided that the number of happenings in a fixed 
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tunc interval has a Poisson distribution We comment on this again in Subsec 
4 2 below Also if we assume again that the number of happenings in a fixed 
time interval is Potsson distributed, the length of time between tune 0 and the 
instant when the rth happening occurs can be shown to have a gamma distnbu 
tion So a gamma random variable can be thought of as a continuous waiting 
time random variable It is the time one has to wait for the rth happening 
Recall that the geometric and negative binomial random variables were dis 
crete waiting tune random variables In a sense, they are discrete analogs ofthe 
negative exponential and gamma distributions, respectively 


Theorem 17 If the random variable X has a gamma distribution with 
parameters r and X, where r is a positive integer, then 


F x (*) - 1 ~ 


tTo j' 


(3J) 


proof The proof can be obtained by successive integrations by 
parts //// 

For A — 1, F z (x) given in Eq (33) is called the incomplete garr.ma function 
and has been extensively tabulated 


Theorem 18 If the random variable X has an exponential distribution 
with parameter X then 

PI*> « + *}*> a] -_ppr>i), for <7 > 0 and i > 0 


PROOF 


P\X>a + b\X>a\ 


?[*>« + 6 ] 

PlX>a] ~ e~ u 

= e~»^P[X>b] (HI 


Let X represent the lifetime of s given component, then, m words. 
Theorem 18 states that the conditional probability that the component will last 
a + b time units given that n has lasted a time units is the same as its initial 
probability of lasting b time units Another way of saying this is to say that ar 
“old*’ functioning component has the same lifetime distribution as a “new’ 
functioning component or that the component is not subject to fatigue or tc 
wear 
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3.4 Beta Distribution 

A family of probability densities of continuous random variables taking on values 
in the interval (0, 1) is the family of beta distributions. 

Definition 14 Beta distribution If a random variable X has a density 
given by 


fx(x) =fx(x; a, b) « glj jc’-'d - „(*), (34) 

where a > 0 and b > 0, then X is defined to have a beta distribution. HU 

The function B (a, b) = — *) 6_l dx, called the beta function, is 

mentioned briefly in Appendix A. 

Remark The beta distribution reduces to the uniform distribution over 
(0,1) if a = 6 = 1. HU 

Remark The cumulative distribution function of a beta-distributed 
random variable is 

F x (x; a, b) - /, o. .,(*)/* '(1 - «)*' ' du + J IU K) (x); (35) 

it is often called the incomplete beta and has been extensively tabulated. 

//// 
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The moment generating function for the beta distribution does not have a 
simple form, however the moments are readily found by using their definition 


Theorem 19 

If X is a beta distributed random variable, then 

*1*1- 

a + b 

and 

Var m (a + b + I)(a + d) 2 

PROOF 

S[X k ) 

i = B(a~6) 



B(/t + a,b)_ r(k + a)r(b) T(a + b) 
b ( 0,6) ~r(fc + o + 6) r(u)r(fi) 
r(Jt + o)r(a 4- b) 

~ r(a>r(Jt + a + by 

hence, 

r(n + l)r(fl + f>) ^ a 
1 j r(a)r(a + * + l) a + b' 


and 


(a + l)a / 0 V ofc 

= (a + 6 + l)(a + b)~ \a + b) ~ (a + b + I)(a + b) 2 


mi 


The family of beta densities is a two parameter family of densities that 
is positive on the interval (0 l) and can assume quite a variety of different 
shapes, and consequently the beta distribution can be used to model an expen 
ntent for which one of the shapes is appropriate 


3 5 Other Continuous Distributions 

In this subsection other parametric families of probability density functions that 
will appear later in this book are briefly introduced, many other families exist 
The introductions of the three families of distributions, that go by the names of 
Student’s t distribution, chi square distribution, and F distribution, are de- 
ferred until Chap VI These three families, as we shall see, are very important 
when sampling from normal distributions 
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Cauchy distribution A distribution which we shall find useful for illustrative 
purposes is the Cauchy , which has the density 


fx (*» a > P) — 


1 

nP{l + [(jc - a)IP] 2 } ’ 


(36) 


where — oo < a < oo and /? > 0. 

Although the Cauchy density is symmetrica] about the parameter a, its 
mean and higher moments do not exist. The cumulative distribution function 
is 


1 r x 
71 


du 


a 


1 1 X ~ 

= - 4- - arc tan — - 

2 71 P 


(37) 


Lognormal distribution Let X be a positive random variable, and let a new 
random variable Y be defined as Y = log e X If Y has a normal distribution, 
then X is said to have a lognormal distribution. The density of a lognormal 
distribution is given by 

/ (x;n, a 2 ) = X exp [ ' “ 2 ^ ( loge x ~tf] 7 «>. <*>)(*)> ( 38 ) 

where — oo < ^ < cx) and c > 0. 

<?[X] = and var [X] = e 2 ^ 2 * 2 - e 2M+ff2 (39) 

for a lognormal random variable X. Also, if X has a lognormal distribution, 
then [log c (, X )] = g, and var [log* (X)] = cr 2 . 


Double exponential or Laplace distribution A random variable X is said to 
have a double exponential , or Laplace , distribution if the density function of X 
is given by 

/*(*) =/*(*; a. P) = ^ exp ■ 3C -^ — ) > ( 40 ) 

w here — oo < a < oo and ft > 0. If X has a Laplace distribution, then 

S[X] = a and var [X] — 2/? 2 . (41) 

Weibull distribution The density 

/(•*; *) = abx b ~ 1 e~ axb I l0t a) (x) 


(42) 
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where a > 0 and b > 0 is called the Wetbullte nsity, a distribution that has been 
successfully used in reliability theory For b -« 1, the WeibuH density reduces 
to the exponential density It has mean (l/a)' ,k r(l + b“*) and variance 

(iM 1 /fc {r(i + 2*' 1 ) - r 2 (t + *“')] 

Logistic distribution The logistic distribution is given in cumulative distnbu 
lion form by 

where -oo < a < oo and /? > 0 The mean of the logistic distribution is given 
by a The variance is given by P 2 it l }3 Note that F(a — d, a, /}) = 
1 — F(ar + d a p) and so the density of the logistic is symmetrical about a 
This distribution has been used to model tolerance levels in bioassay problems 


Pareto distribution The Pareto distribution is given in density function form 
by 

AC*,**, 0) = ^ l44) 

where 0>O and x o >0 The mean and variance respectively of the Pareto 
distribution are given by 

?TT r “ 0> > awl ^-(iPT )’ fot 0>1 

This distribution has found application in modeling problems involving distribu- 
tions of incomes when incomes exceed a certain limit jc 0 


Gumbel distribution The cumulative distribution function 

F(jt, «,/?) = exp(-e" <x ~* )fl! ), ( 45 > 

where -<o < a < co and /? > 0 is called the Gumbel distribution It appears 
as a hunting distribution in the theory of extreme-value statistics 


Pearsonian system of distributions Consider a density function AW 
which satisfies the differential equation 


for constants a b 0 ,b„ and b 2 


1 d/jrOO ^ * + a 

f x (x) dx " b 0 + b l x + b 2 x i 


(46) 


Such a density is said to belong to th ePearsonm 
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system of density functions. Many of the probability density functions that we 
have considered are special cases of the Pearsonian system. For example, if 

fxi X ) ~ /[O.ccjto, 

then 


l df x (x) _ r — 1 _ x — (r — 1)/;. 

/ x (x) dx ' X -xjk 

for x > 0; so the gamma distribution is a member of the Pearsonian system with 
a = — (r — 1)/A, = — 1/A, and b 0 =b 2 - 0. 

4 COMMENTS 

We conclude this chapter by making several comments that tie together some 
of the density functions defined in Secs. 2 and 3 of this chapter. 

4.1 Approximations 

Although many approximations of one distribution by another exist, we will 
give only three here. Others will be given along with the central-limit theorem 
in Chaps. V and VI. 

* Binomial by Poisson We defined the binomial discrete density function, 
with parameters n and /?, as 

(”)/> x (l -p) n ~ x for* =0,1 

If the parameter n approaches infinity and p approaches 0 in such a way that 
np remains constant, say equal to A, then 

Qp'd -pr x -* e -^f < 47 ) 

for fixed integer .v. The above follows immediately from the following con- 
sideration: 
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^-*1 "* an <* (l -*e~ l as n -+ oo 

Thus for large n and small p the binomial probability -p)*~* 

can be approximated by the Poisson probability e~* r (np)*}x] The utilit^ of 
this approximation is evident if one notes that the ^ binom ial probability in_ 
vo Ives two param eters an' d’th eT'o isso n only one j 

Binomial and Poisson by normal 


Theorem 20 Let random variable X have a Poisson distribution with 
parameter 2, then for fixed a <b 


= PU + a s /\<X<X + b y /X]-* ®(b)~ 4>(a) as X-*oo 


(48) 


proof Omitted [Eq (48) can be proved using Stirlings formula 
which is given in Appendix A It also follows from the central limit 
theorem ] //// 


Theorem 21 De Moivrc-Laplace limit theorem Let a random variable 
X have a binomial distribution -with parameters r, and p, then for fixed 
a<b 

p j® ^ 5 *] ~^ n P + °s f”pk £X£np + by/npq) -* 

$(fc) — O(a) as n -* co (49) 

proof Omitted (This is a special case of the central limit 
theorem given in Chaps V and VI ) fill 

Remark We approximated the binomial distribution with a Poisson 
distribution in Eq (47) for large n and small p Theorem 21 gives a 
normal approximation of the binomial distribution for large n }M 


The usefulness of Theorems 20 and 21 rests in the approximations that 
they give For instance Eq (49) states that P[np + ajnpq £X£np + bjnpq] 
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is approximately equal to $(£>) - 0(a) for large n. Or if c = up + ajnpq and 

d = rtp + bsfnpq, then Eq. (49) gives that P[c 2 X < d] is approximately equal 
to 



for large n, and, so, an approximate value for the probability that a binomial 
random variable falls in an interval can be obtained from the standard normal 
distribution. Note that the binomial distribution isdiscreteand the approximat- 
ing normal distribution is continuous. 


EXAMPLE 15 Suppose that two fair dice are tossed 600 times. Let X 
denote the number of times a total of 7 occurs. Then X has a binomial 
distribution with parameters ^ = 600 and p=%. £[X] = 100. Find 

P[90 <,X < 110]. 


P[ 90<X<110] = 


/600\ /iy/5\ 600 -' 
jho \ j / \6/ \6J 


a sum that is tedious to evaluate. Using the approximation given by 
Eq. (49), we have 


P[90 < A" < 110] 



= $(/f) - $(-,/f) « $(1,095) - $(-1,095) 


.726. 

//// 


4.2 Poisson and Exponential Relationship ^ 

When the Poisson distribution was introduced in Subsec. 2.4, an experiment 
consisting of the counting of the number of happenings of a certain phenomenon 
in time was given special consideration. We argued that under certain conditions 
the count of the number of happenings in a fixed time interval was Poisson dis- 
tributed with parameter, the mean, proportional to the length of the interval. 
Suppose now that one of these happenings has just occurred; what then is the 
distribution of the length of time, say X, that one will have to wait until the 
next happening? P[X > t] =P[no happenings in time interval of length /] = 
where v is the mean occurrence rate ; so 


f x (t) =p[x< t] = 1 -P[x> t] = l - <r vf 


for / > 0; 
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that is, Jfhas an exponential distribution On the other hand, a can be proved, 
under an independence assumption, that if the happenings are occurring « 
time in such a way that the distribution of the lengths of time between successive 
happenings is exponential then the distribution of the number of happenings 
in a fixed time interval is Poisson distributed Thus the exponential and Poisson 
distributions are related 


4.3 Contagious Distributions and Truncated Distributions 

A brief introduction to the concept of contagious distributions ts given here 
If/ 0 { ),/,{ ), ,/„( ), is a sequence of density functions which are either all 

discrete density functions or all probability density functions which may or may 
not depend on parameters, and p a , p if ,p„, is a sequence of parameters 

satisfying p t 2. 0 and = 1, then j^pj^x) is a density function, which is 
1*0 1-0 

sometimes called a contagious distribution ot a mixture For example, if 
/ 0 (x) *= ^ (a normal u-ith mean p 0 and variance al) and /,(*) = $>„ t ^(x), 

then 


4> UI „»(*) 

* (l -py—L-e- Him** + «*-«».!» (50) 

\J2na 0 y/lnffi 

where p t ~p and p a l ~p,is a mixture of two normal densities Equation 
(50) is also sometimes referred to as a contaminated normal A random variable 
X has distribution given by Eq (50) if it is normally distributed with mean ji , and 
variance <r* with probability p and normally distributed with mean p Q and vari- 
ance el with probability 1 — p Contagious distributions or mixtures can be 
useful models for certain experiments For instance, the mixture of two normal 
distributions given in Eq (50) has five parameters, namely, p, p 0 , pj, <x 0 , and 
c i if we vary these five parameters, the density can be forced to assume a 
variety of different shapes, some of which are bimodal, that is, the density has 
two distinct local maximums 

Physical considerations of the random experiment at hand can sometimes 
persuede one to consider modeling the experiment with a mixture The 
experimenter may know that the phenomena that he is observing are a mixture, 
for example, the radioactive particle emissions under observation might be a 
mixture of the particle emissions of two, or several, different types or radioactive 
materials 
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The concept of mixing can be extended. Let {/(x; 0)} be a family of 
density functions parameterized or indexed by 0. Let the totality of values 
that the parameter 0 can assume be denoted by ©. If § is an interval (possibly 
infinite) and g{0) is a probability density function which is 0 for all arguments 
not in ©.then 


%( 0 ) dB (51) 

is again a density function, called a contagious distribution or a mixture. For 
example, suppose f(x; 0) = e~ e 0*/x! for x = 0, 1 , 2, . . . and /(x; 0) = 0 other- 
wise and 


m= W)° r ' e ~ xei ^ {0) 


a gamma density. Then 

,eo „co e -ogx % 


;. r 


J o f(x;6) ■ g(6) dd =j^ e -~ • ~ B'-^ dO 

6 r+x ~ l e~ (x+1)0 dd 

Jn 

*) r a 

r+ * J n 


x!T(r) Jo 

A r r(r + x) c“ [(A + l)0] r+j; - l e~ a+ 110 d[(J. + 1)0] 
Jq 


xir(r) (A + 1)' 

T(r + x) 1 


T(r + x) 


/ a yr (r 

\A + V (x!)T(r) (A + l) x 


which is the density function of a negative binomial distribution with param- 
eters r and p =A/(A + 1). We say that the derived negative binomial distri- 
bution is the gamma mixture of Poissons. 


e~°0 x 

x! 


g(fl) dO 


is sometimes called a compound Poisson, where g{0)I {Ota>) {0) is a probability 
density function. 

We have sketchily illustrated above how new parametric families of den- 
sities can be obtained from existing families by the technique of mixing. In 
Subsec. 2.6 we indicated how truncation could be employed to generate new 
families of discrete densities. Truncation can also be utilized to form other 
families of continuous distributions. For instance, the family of beta distri- 
butions provides densities that are useful in modeling an experiment for which 
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it is known that the values that the random variable can assume are between 0 
and I A truncated normal or gamma distribution would also provide a useful 
model for such an experiment A normal distribution that is truncated at 0 
on the left and at l on the right is defined in density form as 


/(XW(X,J1,0) = 


4>0 A*)I ( o n(*) 


(52) 


This truncated norma! distribution, like the beta distribution, assumes values 


between 0 and J 

Truncation can be defined in general If A' is a random variable with 
density// ) and cumulative distribution F x (), then the density of X truncated 
on the left at a and on the Tight at b is given by 


fx{x)I(. t) (*> 
F x (b) - F x {a) 


(53) 


PROBLEMS 

1 (a) Let X be a random variable having a binomial distribution with parameters 

n =25 and p = 2 Evaluate P[X < /** — 2a x ] 

(b) If AT is a random variable with Poisson distribution satisfying P[X = 0] = 
P(AT=1] what is flX] 7 

(c) If X is uniformly distributed over (1, 2) find z such that P[ X > z + «* ! 

(rf) If Afis normally distributed with mean 2 and variance J, find/’t] X~ 2\ < I] 
(e) Suppose X is bmomially distributed with parameters n and p further sup- 
pose that =* 5 and var (Afj = 4 Find n and p 

if) If &[X\ = 10 and a K = 3, can X have a negative binomial distribution? 

(g) If AT has a negative exponential distribution with mean 2, find P(Jf < I { X < 2] 
(A) Name three distributions for which P { X ^ = | 

(0 Let X be a random variable having binomial distribution with parameters 
n = 100 and p = 1 Evaluate P[X g/x* — 3 it*) 

0) If X has a Poisson distribution and P[X — 0] = }, what is £fA(]? 

(A) Suppose AT has a binomial distribution with parameters n and p For what 
p is var [X] maximized if we assumed n is fixed? 

(0 Suppose X has a negative exponential distribution with parameter A If 
W£l]-/IJir>IJ what is varm? 

(m) Suppose X is a continuous random variab/e with uniform distribution 
having mean 1 and variance J What is P[X < OJ? 

(fl) If X has a beta distribution can <f [I IX] be unity ’ 

(°) Can X ever have the same distribution as— JP Ifso when’ 

(p) If AT is a random variable having moment generating function exp(« — !) 
what is <?[*]? 

2 (a) Find the mode of the beta distribution 
(£>) Find the mode of the gamma distribution 
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Name a parametric family of distributions which satisfies: 

{a) The mean must be greater than or equal to the variance. 

(b) The mean must be equal to the variance. 

(c) The mean must be less than or equal to the variance. 

{d) The mean can be less than, equal to, or greater than the variance (for dif- 
ferent parameter values). 

(a) If X is normally distributed with mean 2 and variance 2, express 
Pl\X-l\<2] in terms of the standard normal cumulative distribution 
function. 

( b ) If X is normally distributed with mean fi > 0 and variance a 2 — /x 2 , express 
P[X < — p\X </x] in terms of the standard normal cumulative distribution 
function. 

(c) Let X be normally distributed with mean p and variance a 2 . Suppose a 2 is 
some function of /x, say a 2 = A(/x). Pick /](•) so that P[X <, 0] does not 
depend on p for /x > 0. 

Use the alternate definition of the median as given in the remark following Defi- 
nition 18 of Chap. II. Find the median in each of the following cases: 

(a) f\(x) = Ae~ Ax / ( o. »,(*). 

(b) X is uniformly distributed on the interval (0 l9 6 2 ). 

(c) X has a binomial distribution with n — 4, p = .5. 

{d) X has a binomial distribution with // — 5, p = .5. 

(e) X has a binomial distribution with /z = 2, p = .9. 

A contractor has found through experience that the low bid for a job (excluding 
his own bid) is a random variable that is uniformly distributed over the interval 
(JC, 2C), where C is the contractor’s cost estimate (no profit or loss) of the job. 
If profit is defined as 0 if the contractor does not get the job (his bid is greater than 
the low bid) and as the difference between his bid and his cost estimate C if he gets 
the job, what should he bid (in terms of C) in order to maximize his expected 
profit? 

A merchant has found that the number of items of brand XYZ that he can sell 
in a day is a Poisson random variable with mean 4. 

(a) How many items of brand XYZ should the merchant stock to be 95 percent 
certain that he will have enough to last for 25 days? (Give a numerical 
answer.) 

(b) What is the expected number of days out of 25 that the merchant will sell 
no items of brand XYZl 

(a) If Xis binomially distributed with parameters n and p t what is the distribution 
of Y = n~ XI 

(b) Two dice are thrown n times. Let X denote the number of throws in which the 
number on the first die exceeds the number on the second die. What is the 
distribution of XI 

*(c) A drunk performs a 11 random walk” over positions 0, ± 1 , ±2, ... as follows: 
He starts at 0. He takes successive one-unit steps, going to the right with 
probability p and to the left with probability 1 — p. His steps are inde- 
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pendent Let X denote his position after n steps Find the distribution of 
{X + n)/2, and then find <f[X] 

♦(d) Let Xt (Xj) have a binomial distribution with parameters n and p, (» and p,) 

If p, <pj, show that PIXt <U)^PlX>^k] for 1, . , n (This 
result says that the smaller the p, the more the binomial distribution is shifted 
to the left ) 

9 in a town with 5000 adults, a sample of 100 is asked their opinion of a proposed 
municipal project, 60 axe found to favor it, and 40 oppose it If, in fact, the 
adults of the town were equally divided on the proposal, what would be the prob- 
ability of obtaining a majority of 60 or more favoring it m a sample of 100 ? 

10 A distributor of bean seeds determines from extensive tests that 5 percent of a large 
batch of seeds will not germinate He sells the seeds in packages of 200 ami 
guarantees 90 percent germination What is the probability that a given package 
will violate the guarantee 1 * 

* II ( a ) A manufacturing process is intended to produce electrical fuses with no 
more than 1 percent defective it is checked every hour by trying 10 fuses 
selected at random from the hour's production Jf 1 or more of the 10 
fail, the process is halted and carefully examined If, in fact, its prob- 
ability or producing a defective fuse is 01 , what is the probability that the 
process w/II needlessly be examined m a given instance? 

( 6 ) Referring to part (a), how many fuses (instead of 10) should be tested if the 
manufacturer desires that the probability be about 95 that the process wilt 
be examined when it is producing 10 percent defectives 7 

12 An insurance company finds that 005 percent of the population die from a certain 
kind of accident each year What is the probability that the company must pay 
off on more than 3 of 10,000 insured risks against such accidents in a given 
year? 

13 (a) If X has a Poisson distribution with F[X =11 = P[X ~ 2], what is 
P(X«Ior 2] 7 

(b) If Xhas a Poisson distribution with mean J, show that <f [IX— 1 |] «2<x tie 

*14 Recall Theorems 4 and 8 Formulate, and then prove or disprove a similar 
theorem for the negative binomial distribution 

*15 Let X be normally distributed with mean fi and variance o*. Truncate the density 
of X on the left at a and on the right at b, and then calculate the mean of the trun- 
cated distribution (Note that the mean of the truncated distribution should fall 
between a and b Furthermore, if a = p — c and b = p + e, then the mean of the 
truncated distribution should equal 41 ) 

*i6 Show that the hypergeometric distribution can be approximated by the binomial 
distribution for large M and K, j e , show that 
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77 Let X be the life in hours of a radio tube. Assume that X is normally distributed 
with mean 20 and variance a 2 . If a purchaser of such radio tubes requires that 
at least 90 percent of the tubes have lives exceeding J50 hours, what is the largest 
value g can be and still have the purchaser satisfied? 

18 Assume that the number of fatal car accidents in a certain state obeys a Poisson 
distribution with an average of one per day. 

(a) What is the probability of more than ten such accidents in a week? 

(b) What is the probability that more than 2 days will lapse between two such 
accidents? 

19 The distribution given by 

f(x; xe’* x «' 2 l i0 . *,Cv) for P > 0 

is called the Rayleigh distribution. 

(a) Show that the mean and variance exist, and find them. 

( b ) Does the Rayleigh distribution belong to the Pearsonian system? 

20 The distribution given by 

f(x; p) = ^rr x*e-* 2 » 2 I< o. *,(*) for p > 0 
P yf TT 

is called the Maxwell distribution. 

(a) Show that the mean and variance exist, and find them. 

(b) Does this distribution belong to the Pearsonian system? 

21 The distribution given by 

is called the r distribution . 

(a) Show that the mean and variance exist, and find them. 

(b) Does this distribution belong to the Pearsonian system? 

22 A die is cast until a 6 appears. What is the probability that it must be cast more 
than five times? 

23 Red-blood-cell deficiency may be determined by examining a specimen of the 
blood under a microscope. Suppose that a certain small fixed volume contains, 
on an average, 20 red cells for normal persons. What is the probability that a 
specimen from a normal person will contain less than 15 red cells? 

24 A telephone switchboard handies 600 calls, on an average, during a rush hour. 
The board can make a maximum of 20 connections per minute. Use the Poisson 
distribution to evaluate the probability that the board will be overtaxed during any 
given minute. 

25 Suppose that a particle is equally likely to release one, two, or three other particles, 
and suppose that these second-generation particles are in turn each equally likely 
to release one, two, or three third-generation particles. What is the density of 
the number of third-generation particles? 
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26 Find the mean of the Gumbel distribution 

27 Perive the mean and variance of the Weibull distribution 

*28 Show that 

=^ ki r-r+T)l **'(* 

for X a binointally distributed random variable That is if X is binomially dis 
tnbuted with parameters n and p and Y is beta distributed with parameters k and 
n-k+ 1 then F,(p) * 1 - F,[k - I) 

*29 Suppose that A" has a binomial distribution with parameters n and p and Y has a 
negative binomial distribution with parameters r and p Show that F%{r — |) a 
I -F,{tt-r) 

*30 If It is a random variable that i$ uniformly distributed over the interval [D, I] then 
the random variable Z L = [U l — (l — A is said to have Tukt/s symmetrica/ 
lambda distribution Find the first four moments of Z t Find two different A’s, 
say Ai and A 2 such that Zi, and Zx 1 have the same first four moments and unit 
standard deviations 



IV 

JOINT AND CONDITIONAL DISTRIBUTIONS, 
STOCHASTIC INDEPENDENCE, 
MORE EXPECTATION 


1 INTRODUCTION AND SUMMARY 

The purpose of this chapter is to introduce the concepts of ^-dimensional 
distribution functions, conditional distributions, joint and conditional expecta- 
tion, and independence of random variables. It, like Chap. II, is primarily a 
“ definitions-and-their-understanding ” chapter. 

The chapter is divided into four main sections in addition to the present 
one. In Sec. 2, joint distributions, both in cumulative and density-function 
form, are introduced. The important /:-dimensional discrete distribution, 
called the multinomial \ is included as an example. Conditional distributionsand 
independence of random variables are the subject of Sec. 3. Section 4 deals 
with expectation with respect to /:-variate distributions. Definitions of covari- 
ance, the correlation coefficient, and joint moment generating functions, all 
of which are special expectations, are given. The important concept of condi- 
tional expectation is discussed in Subsec. 4.3. Results relating independence 
and expectation are presented in Subsec. 4.5, and the famous Cauchy-Schwarz 
inequality is proved in Subsec. 4.6. The last main section, Sec. 5, is devoted 
to the important bivariate normal distribution, which gives one unified example 
of many of the terms defined in the preceding sections. 
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1 

This chapter is the multidimensional analog of Chap 11 It provides t 
definitions needed to understand distributional theory results of Chap V i 


2 JOINT DISTRIBUTION FUNCTIONS 
In the study of many random experiments, there are, or can be, more than one 
random variable of interest, hence we are compelled to extend our definitions 
of the distribution and density function of one random variable to those of 
several random variables Such definitions are the essence of this section 
which is the multivariate counterpart of Secs 2 and 3 of Chap It As in the 
univariate case we will first define, in Subsec 2 I, the cumulative distribution 
function Although it is not as convenient to work with as density functions, 
it does exist for any set of k random variables Density functions for jointly 
discrete and jointly continuous random variables will be given in Subsecs 2 2 
and 2 3 respectively 

2 } Cumulative Distribution Function 

' Definition 1 Joint cumulative distribution function Let A\, X z , ,X k 
be k random variables all defined on the same probability space 
(fi sd, P[ 1) The joint cttmuJatne distribution function of X\, . Xt, 

denoted by F t% , ) is defined as P[X x £ Jr t , , AT* £ for 

(A Hlf 

Thus a joint cumulative distribution function is a function with domain 
euclidean k space and counterdomain the interval (0 1] If k ~ 2 the joint 
cumulative distribution function is a function of two variables and so its 
domain is just the xy plane 


EXAMPLE I Consider the experiment of tossing two tetrahedra (regular 
four sided polyhedron) each with sides labeled J to 4 Let X denote the 
number on the downlurned face of the first tetrahedron and Y the larger 
of the downturned numbers The goal tS'io find F x »( , ) the joint cu* 
mulative distribution function of X and Y Observe first that the random 
variables X and Y jointly take on only the values 

0,1) Or 2) 0.3), 0.4), 

(2. 2) , {2, 3) (2.4), 

(3.3) , (3,4), 

M.4J 
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FIGURE 1 

Sample space for experiment of tossing 
two tetrahedra. 


*2 



The sample space for this experiment is displayed in Fig. 1. The 16 
sample points are assumed to be equally likely. Our objective is to find 
F Xi y(*> y ) for each point (x, y). As an example let (x, y) = (2, 3), and 
find F x # y (2, 3) — P[X <2; Y <> 3]. Now the event { X ^ 2 and 7^3} 
corresponds to the encircled sample points in Fig. 1; hence F XtX {l, 3) = 
Similarly, F Xt y (x, j>) can be found for other values of x and y. 
F x , r(*» y) is tabled in Fig. 2. //// 


We saw that the cumulative distribution function of a unidimensional 
random variable had certain properties; the same is true of a joint cumulative. 
We shall list these properties for the joint cumulative distribution function of 
two random variables; the generalization to k dimensions is straightforward. 



TABLE OF VALUES OF Fx.r (*,>) 


4 <y 

0 

ft 

h 

41 

I 

3 <y< 4 

0 

ft 

•fir 

ft 

■ft 

RKSSI 

0 

T ft 

ft 

ft 

ft 

1 <y< 2 

0 

ft 

ft 

ft 

ft 

y<\ 

0 

0 

0 

0 

0 


x<\ 

1 <x<2 


3 <x<4 

4<x 


FIGURE 2 
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Properties of bivariate cumulative distribution function F(-, •) 

(,) F(-oo, y)= Iim F(x, y) = 0 forall y, F(x, - co) = Inn /'Or,^) = o 
for all x, and lun F{x, y) ~ F(cO, co) *= 1 . 

(u) If jci < jc 4 and y x <yi, then F[x^ <X£x t iy i <Y£y i ] 

= F(xt , y 2 ) - F(xt , y t ) - ^(*i, yi) + H*u yi) ^ 0 
(m) F(x, y) is right continuous in each argument; that is, 
lim F(x + ft, y) =* hm F(x, y + h) — F(x, y) 

0<k->Q 0<A-»0 

We will not prove these properties Property (u) is a monolonktly 
property of sorts, it is not equivalent to F(x lt y 2 ) <, F[x 2 , y 2 ) for x t £x 2 
and y t <,y >2 Consider, for example, the bivariate function G(x, y ) defined 
as in Fig 3 Note that G(x x , y x )SG(*j, y 2 ) for * 2 £x 2 and y 2 :Sy 2 , 
yet C(I + £,I + e) - <?(I + c,l - «) - G(l - e,l + e) + G(l - e,l - e) » l - 
(1 - e) — (1 - 1 ) =* 2e - 1 < 0 for e < -J; so G(x, y) does not satisfy property 
(u) and consequently is not a bivariate cumulative distribution function. 


Definition 2 Bivariate cumulative distribution function Any function 
saUsfytng properties (i) to (ui) is defined to be a Snartafe ctrniufatiw 
distribution function without reference to any random variables ({(( 

^Definition 3 Marginal cumulative distribution function If F^ r (' f )is 
the joint cumulative distribution function of X and Y, then the cumulative 
distribution functions Fx( ) and F x ( } are called marginal cumulatae 
distribution functions //// 


TABLE OF Gtx, y) 


\<,y 1 

tr 1 

~ 1 

1 

OSXI j 

0 

0 

y 

y<Q 

0 

0 

0 


x<0 

1 0£*<I 

— 

t£x 


FIGURE 3 
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Remark F x (x) — F Xt r(*» °o), and F Y (y) *= F Xt y(co, y); that is, knowl- 
edge of the joint cumulative distribution function of X and Y implies 
knowledge of the two marginal cumulative distribution functions. //// 


The converse of the above remark is not generally true; in fact, an example 
(Example 8) will be given in Subsec. 2.3 below that gives an entire family of 
joint cumulative distribution functions, and each member of the family has the 
same marginal distributions. 

We will conclude this section with a remark that gives an inequality 
involving the joint cumulative distribution and marginal distributions. The 
proof is left as an exercise. 

Remark F x (x) + F r (y) - I < F Xt y (.v, < JF x (x)F>(y) for all x, y. 

mi 


2.2 Joint Density Functions for Discrete Random Variables 

If X ]9 X 2 * ...» Xk arc random variables defined on the same probability space, 
then (A'i, X 2 , .... X k ) is called a k-dimcnsional random variable . 

Definition 4 Joint discrete random variables The A-dimcnsional ran- 
dom variable (A",, X 2 , .... X k ) is defined to be a k-diniensional discrete 
random variable if it can assume values only at a countable number of 
points („Yj« x 2 * .... x k ) in A-dimensional real space. We also say that 
the random variables X\ % X 2% X k are joint discrete random variables. 

mi 

Definitions Joint discrete density function If (X l9 X k ) is 

a A-dimensional discrete random variable, then the joint discrete density 

function of (X u X 2 , X k ) 9 denoted by f XuXt ^(*, is defined 

to be 

*2 » • * # » ^Jfc) ~ F[X I = X] ,X 2 — %2 * • • • t X k = **/.] 

for (x l9 x 2 , - **)> a value of (X l9 X l9 X k ) and is defined to be 0 

otherwise. till 


Remark £/ Xl * k (* j, x k ) = I, where the summation is over all 

possible values of (A' j5 . . . , X k ). Ill I 
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fx 



EXAMPLE 2 Let X denote the number on the dovvnturned face of the first 
tetrahedron and TtheJargerof the downturned numbers in the experiment 
of tossing t*vo tctrahcdra The values that (-T, Y)c an fake on are (1, I), 
0, 2), (1, 3), (1,4), {2, 2), <2, 3), (2, 4), (3, 3), (3, 4), and (4, 4); hence Xund 
Y are jointly discrete The joint discrete density function of X and T 
is given in Fig 4 

In tabular form it is given as 


U,y) j 

f 0. i) 

(n.2)! 

| d.3)| 

|0.4) 

1(2,2); 


| (2, 4) 

1(3,3) 

1(3,4) 

(4,4) 

fx *(*>>) | 

[ iV 

iV 1 

l iV 1 

1* 

* 

rrl 

1 

~h 

1 * 

T5 


or in another tabular form as 


4 1 

I 

* ! 

* 

1*5 

1*5 

3 1 

. | 

TS 

iV 

1^5 


2 I 

t*s 1 




I 

tV 





I 

2 

j 3 

! 4 


Theorem 1 If X and Y are jointly discrete random variables, then 
knowledge of F Xtt ( , *) is equivalent to knowledge of •)• Also, 
the statement extends to k -dimensional discrete random variables 
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PROOF Let (a'j, yi), (x 2 , y 2 ), ... be the possible values of (X, Y). 
Jf/x, y(‘» ’) is given, then F XfY (x, ~Y.fx, y(*i, where the summa- 
tion is over all / for which x, < x and y,- < y. Conversely, if F x _ y (-, ■) is 
given, then for (a, , yi), a possible value of (X, Y), 

fx, r( x i , yi) = F x . r( x i . yi) — lim F Xt y (xj — h, yi) 

0 < A -»0 

- lim F Xt y(Xi,y, - h) 

0 </ j -0 

+ lim F x>y (x, — h,y t — h). //// 

0 </ j -0 

Definition 6 Marginal discrete density If X and Y are jointly discrete 
r andom variables, then /*(•) and f Y ( • j~ are~~caTled 'mar^iard^scvtio 
density functions. More generally, let X ix , X im be any subset of the 

jointly discrete random variables X x , X k ; then f Xlx Xl (x ti , . . . , x im ) 

is also called a marginal density . //// 

Remark If X ly . . . , X k are jointly di screte random variables, then any 
marginal d iscfeS^fiermty"' can be fo u nd from the joint density, but not 
conversely. For example7if~A and r are jointly^discrete with values 
, y 2 ), . . . , then 

f x ( x k) = £ fx,*( x i>yi) and fr(yd = £ fx.r( x i,yd- till 

U:xt-x k ) U:yt*=yk) 

Heretofore we have indexed the values of ( X , 7) with a single index, 
namely /. That is, we listed values as (x u y\\ (x 2 , y 2 ), . . . , (x t , y { ), . . • . The 
values of ( X , Y) could also be indexed by using separate indices for the X and Y 
values. For instance, we could let i index the possible X values, say x u 

x ( -, . . . , and j index the possible Y values, say y u . . y J% Then the values 

of(A; Y) would be a subset of the points (x t , yj) for / = 1,2, ...andy = 1,2, “ 

If this latter method of indexing is used, then the marginal density of X is 
obtained as follows: 

/ fx( x k) = llfx,Y( x k,yj), 

j 

where the summation is over all yj for the fixed x k . The marginal density of Y 
is analogously obtained. The following example may help to clarify these two • 
different methods of indexing the values of (X, Y). 

EXAMPLE 3 Return to the experiment of tossing two tetrahedra, and define 
X as the number on the downturned face of the first tetrahedron and Y as 
the larger of the numbers on the two downturned faces. The joint 
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density of X and Y is given m Fig, 4 The values of (AT, Y) can be luted 
as (1 1), (1. 2), (1, 3), (1, 4), (2, 2), (2, 3), (2, 4), (3, 3), (3, 4), and (4, 4), 
10 points in all Or, if we note that X has values 1, 2, 3, and 4, 7 has 
values 1, 2, 3, and 4, and Y is greater than or equal X, the values of 
(X, Y) are {(i, j) i= 1, , 4, / = 1, , 4, and , £/} Let us use each 

of these methods of indexing to evaluate F x y (2, 3) from the joint density 
Under the first method of indexing, 

F*r(2,3) = £ 

(UiS2 *<$ai 

=/* r(U)+/r rOi 2) 

+fx t0.3)+A.e( 2,2)+A 
Under the second method of indexing, 

Similarly all other values of F x y( , ) could be obtained Also 
M 3) - £ /* y(*i» y.) =/x y(l> 3) +/ x »(2, 3) +/ x Y (3, 3) 

(' >i*3) 

~~h + T5 + 7T ~ TS 

Similarly / y (l) = ^ / r (2) * and /r(4)- which together with 
/>(3) “to gwe the marginal discrete density function of Y 11(1 


EXAMPLE 4 We mentioned that marginal densities can be obtained from 
the joint density, but not conversely The following is an example of a 
family of joint densities that all have the same marginals and hence we 
see that in general the joint density is not uniquely determined from 
knowledge of the marginals Consider altering the joint density given 
in the previous examples as follows 


4 


T^-e 

Ts l 

TS 

3 


l ’^? + e 

-fa 


2 

iV 

tV 
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iV 
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Sx 1 

1 
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For each 0 ^ e a bove table defines a joint density. Note that 

the marginal densities are independent of e, and hence each of the joint 
densities (there is a different joint density for each O^c < T ^) has the 
same marginals. * //// 

We saw that the binomial distribution was associated with independent, 
repeated Bernoulli trials; we shall see in the example below that the multinomial 
distribution is associated with independent, repeated trials that generalize from 
Bernoulli trials with two outcomes to more than two outcomes. 


EXAMPLE 5 Suppose that there are k 4- 1 (distinct) possible outcomes of a 
trial. Denote these outcomes by jj, j 2 , ...» and let p i = P[j^ 

k+ i 

i = I, . . . , k 4- I . Obviously we must have ]T /?, = 1 » J* ust as p 4* q = 1 in 

i= i 

the binomial case. Suppose that we repeat the trial n times. Let X t 
denote the number of times outcome j ( occurs in the n trials, 
/= 1, k 4- L If the trials are repeated and independent, then the 
discrete density function of the random variables X x , ..., X k is 

n\ k+l 

•••»**) — f+t n p*** (o 

rw 1 

f= I 

fc+l k 

where x, — 0 and Y x i ” fL ^ ote ^ at + 1 “ ~ Z * 

fs* I I st I 

To justify Eq. (I), note that the left-hand side is P[X } = x,; X 2 = x 2 ; 
; X k + x =x i+I ]; so, we want the probability that the n trials result in 
exactly x x outcomes exactly x 2 outcomes j 2 , ...* exactly x k + x outcomes 

i+ 1 

where Y x i “ w * Any s P ec *^ c ordering of these n outcomes has 

1 

probability p\ 1 • p? • * * pJ+Y by the assumption of independent trials, 
and there are /ri/Xj^l ••• x fc+l ! such orderings. HI I 

Definition 7 Multinomial distribution The joint discrete density func- 
tion given in Eq. (1) is called the multinomial distribution . I HI 

The multinomial distribution is a (k 4- 1) parameter family of distri- 
butions, the parameters being n and p t , p 2t ***» Pk> Pk+x q in the 

binomial distribution, exactly determined by p * + x = 1 — p\ — Pi — * * * — pk • A 
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/(*. *l) 



particular case of a multinomial distribution is obtained by putting for example 
n = 3 k = 2 p t =e 2 and p 2 =» 3 to get 

a „(*, *>-/(*. A>-jpvpr ■zz&ivan*-- 

This density is plotted m Fig 5 

We might observe that if X 2 X 2 , Xy have the multinomial distnbu 
tion given in Eq (1) then the marginal distribution of X ( is a binomial disin 
bution with parameters n and p ( This observation can be verified by recalling 
the experiment of repeated independent trials Each trial can be thought of 
as resulting either in outcome a or not m outcome a t , m which case the trial is 
Bernoulli implying that X has a binomial distribution with parameters n and p t 

2 i Joint Density Functions for Continuous Random Variables 

Definitions Joint continuous random variables and density function The 

k dimensional random variable (X it X 2 , X k ) is defined to be a 
k dimensional conhnuoi s random lanabte if and only if there exists a 
function/* *„( ) > 0 such that 

F x x & , **)-J J 1 fx x„(«i, » «t) du, du K (2) 

for all (*! f x t ) is defined to be a joint probability 

density funct on {(jl 
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As in the unidimensional case, a joint probability density function has 
two properties: 


(0 fxx Xubu •••> x k) ^ 0. 

r® r® 

00 ••• fx, x k (x 1 ,...,x k )dx 1 ...dx k = 1. 

J — CD " — on 


A unidimensional probability density function was used to find proba- 
bilities. For example, for X a continuous random variable with probability 
density P[a < X< b] — J b a fxix) dx; that is, the area under f x ( •) over the 
interval (a, b) gave P[a < X < b]\ and, more generally, P[Xe B] = J B fx( x ) dx; 
that is, the area under /*(•) over the set B gave P[X e £]. In the two-dimen- 
sional case, volume gives probabilities. For instance, let (X u X 2 ) be jointly 
continuous random variables with joint probability density function 
fx tt x 2 (x n *2)^ and ^ et P be some region in the x x x 2 plane; then P[(X l9 X 2 ) e R] 
= * 2 ) dx 1 dx 2 ; that is, the probability that (X l9 X 2 ) falls in the 

R 

region R is given by the volume under f Xu * 2 (\ *) over the region R. In particu- 
lar if R = {(x t , x 2 ): a x <x i <b l ;a 2 <x 2 < b 2 }, then 

P[a l < X x < b Y ; a 2 < X 2 < b 2 ] — j fxi f x 2 ( x i* x i) dx 2 . 

A joint probabilitydensityfunction is defined as any nonnegative integrand 
satisfying Eq. (2) and hence is not uniquely defined. 


EXAMPLE 6 Consider the bivariate function 

f(x, y) = K(x + y)I (0t iy (x)r (0t 1} (y) = K(x + y)I v {x, y), 

where U-{(x, y): 0 < jc < I and 0<y< 1}, a unit square. Can the 
constant K be selected so that f(x , y) will be a joint probability density 
function ? If K is positive, f(x , y) > 0. 
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for X* 1 So /( x, y) = (x + >)7 (0 ,,(x)J <0 1)00 « a J oint probability 
density function It is sketched in Fig 6 

Probabilities of events defined in terms of the random variables can 
be obtained by integrating the joint probability density function over the 
indicated region, for example 



= TT +'sV 

which is the volume under the surface z = x+y over the region {(x, y) 
0<*<i, 0<y<£} in the xy plane //// 

Theorem 2 If X and T are jointly continuous random variables then 
knowledge of F x r ( , ) is equivalent to knowledge of an/r r ( , ) The 
remark extends to k dimensional continuous random variables 

PROOF For a given f x y ( , ), F x y (* y) u obtained for any 
(x, y) by 

F x r(x y) = J j f x v (u, i>) du do 
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For given y (*, *), an/ xy (x, y) can be obtained by 


fx, *(*> y) 


PFxMt'y) 

dx dy 


for y points, where F x% y) is differentiable. 


III! 


Definition 9 Marginal probability density functions 3f X and Y are 
jointly continuous random variables, then /*(•) and / y (*) are called 
marginal probability density functions. More generally, let X h , X Jm 
be any subset of the jointly continuous random variables X it X L . 

i • •*»*/„) is called a marginal density of the /^-dimensional 
random variable {X h , . . . , X lm ). //// 

Remark If X u . X k are jointly continuous random variables, then 
any marginal probability density function can be found. (However, 
knowledge of all marginal densities does not, in general, imply knowledge 
of the joint density, as Example 8 below shows.) If A' and Y are jointly 
continuous, then 

fx(x) - f fx.rix, y) dy and f Y (y) = f f x , Y {x, y) dx (3) 

" — GO J -GO 

since 

/,<*> = 1 [£„ (/>..<"• a *) «»] - jo d ,. 

till 


EXAMPLE 7 Consider the joint probability density 

fx, *(•*» 3’) = ( x + 3’) 7 (o, „(x)/, 0 . i )()')• 

3') = /(o.nW^o. i)(30 J J (“ + do 
J o J o 

+ I(0,t)Whu<*>)(y) J o dv 

f 7 /-I 

+ / I1 .«)W7 (0 .i ) (3') J J o (« + ») du do 

= i{(x 2 3’ + xy 2 )/ (0 , i)(x)/ ( o.n(3') + C* 2 + *K<o, 

+ (y + y 2 )^!, co)( x )^(o, n(3')} + 7[i lCD )(x)/[i im) (y). 
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Mx)~ff xt (x,y)dy 

=f ( o i)Wj # {*+y)^ 
“(X + i)^(0 !)(■*)• 


fx(x) 


CFyr ,(*, CO) 


JFxix) 

Sx 


*=(x +i)/ <0 u(x) 


EXAMPLE 8 Let / x (jr) and f r (y) be two probability density functions with 
corresponding cumulative distribution functions F x {x) and F-,{y) respec- 
tively Fot - 1 & a & 1 define 

fx rix, y , «) «/,<x)/ r (y){ 1 + a[2F x (x) - l ]l2F t (y) - IJJ (4) 

We will show (t) that for each « satisfying -1 \,f x x (x, y , *> is a 
joint probability density function and (u) that the marginals of f x rU.>'. 
are/ x {jr) and/ r (y) respectively Thus {f x t (x % y,a) -I^aS 1} will be 
an infinite family of joint probability density functions, each having the 
same two given marginals To verify (i) we must show that /*,*(*, J\ a) 
is nonnegative and, if integrated over the xy plant, integrates to I 

/iW/r 0){I + a[2F x (x) - lK2F r (y) - ]]} £ 0 

but or, 2F x (x) — I, and 2 F x (y)~~ I are all between —l and I, and hence 
also their product which implies f x , (x, y , «) is nonnegative Since 


1 


J_ J r(x) (/_ ^ fx. »(x. > . a) dyj dx. 



3 


CONDITIONAL DISTRIBUTIONS AND STOCHASTIC INDEPENDENCE 143 


it suffices to show thaty^jc) and/ y (y) are the marginals of f x% y (x, y; a). 

r® 

J fx t A x * &)dy 

= J AMAOOU + ot[2 F x (x) - I][2F r G0 - 1]} dy 
J “00 

~fx( x ) f fr(y) dy + uf x (x)[2F x (x) - 1] f [2F r (y) — 1 ]f Y (y) dy 

“CO J -00 

=/*(*), noting that f [ 2 F y ()•) - 13/>0’) dy 
= f l (2»-l)dn=0 

J o 

by making the transformation u = Fy{y). //// 


3 CONDITIONAL DISTRIBUTIONS AND 
STOCHASTIC INDEPENDENCE 

In the preceding section we defined the joint distribution and joint density 
functions of several random variables; in this section we define conditional 
distributions and the related concept of stochastic independence. Most defini- 
tions will be given first for only two random variables and later extended to k 
random variables. 

3.1 Conditional Distribution Functions 
for Discrete Random Variables 

Definition 10 Conditional discrete density function Let X and Y be 
jointly discrete random variables with joint discrete density function 
f Xt y{\ *)• The conditional discrete density function of Y given X — x, 
denoted by f Y jx(* |*)> is defined to be 

r , , , fx. y(x, y) , c , 

M\^-- fx ( x —' (5) 

if/ x (;c)>0, where f x (x) is the marginal density of X evaluated at x. 
/rjx(* \x) is undefined for f x (x) ~ 0. Similarly, 

r / i \ fx, y( x ' y) 

fx\y( x \y) = 77-r- 


if My) > o. 


( 6 ) 

till 
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Since X and Y are discrete, they have mass points, say x„ x 2 , f or x 
and y u Yi for Y W /*(*)> 0, then x~x, for some f, and /*(*,) 
= P[r = jcJ The numerator of the right hand side of Eq (5) is f x j j) 

~P[X~x, Y=yj], so 

f Sx r(*i» y /) Y\X — *i ,Y = yji _ y , . y i 


for yj a mass point of Y and x , a mass point of X, hence/ r | X ( ]*) is a condi 
tional probability as defined in Subsec 3 6 of Chap 1 / rJJr ( jar) is called a 
conditional discrete density function and hence should possess the properties 
of a discrete density function To see that it docs, consider x as some fixed 
mass point of X Then f rl x(y]x) is a function with argument y, and to be a 
discrete density function must be nonnegative and, if summed over the possible 
values (mass points) of Y, must sum to 1 fnx(y\ x ) >$ nonnegative since 
fx r(*» y) is nonnegative and f x (x) is positive 


£/y)xO0l*) = 

j 


y f X r(x yj) 

7 fx W 


fix) l ’ 


where the summation is over all the mass points of Y (We used the fact that 
the marginal discrete density of X is obtained by summing the joint density of 
X and Y over the possible values of Y) So / r |*( |x) is indeed a density, it 
tells us how the values of Y are distribuied for a given value jc of X 

The conditional cumulative distribution of Y given X ~ x can be defined 
for two jointly discrete random variables by recalling the close relationship 
between discrete density functions and cumulative distribution functions 


Definition 11 Conditional discrete cumulative distribution If X and Y 
are jointly discrete random variables, the conditional cumulatne distnbii 
non of 1 given X = *, denoted by /y 1Jr ( j jc) is defined to be Ffy X (y\x) “ 
PrrspfJf=T]forAW>0 lilt 

Remark F rix (y\x)= £ f t]t fy,\x) till 

O' yjsy} 


EXAMPLE 9 Return to the experiment of tossing two tetrahcdra Let X 
denote the number on the downturned face of the first and Y the larger 
of the downturned numbers What is the density of r given that X = 2 * 
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where *1 = 0, 1 2 3, or 4 and /-l, ,4, subject to the restriction that 

£ x, 5 12 There are a large number of conditional densities associated 
with this density, an example is 


fx, x„|jr, xf x i> 

where t =0 1 4 and Xj + a* £ 12 — x, — x 3 


Ifll 


3 2 Conditional Distribution Functions 
for Continuous Random Variables 


Definition 13 Conditional probability density function Let X and Y 
be jointly continuous random variables with joint probability density 
function f x \(x y) The conditional probability density function of Y 
given AT = x, denoted by/ y)X ( jx), is defined to be 

if AW > 0 where f x (x) is the marginal probability density of X, and is 

undefined at points when f K (x) = 0 

Similarly, 

ifA(y)>0. (8) 

and is undefined if /,(y) = 0 //// 


A|x( | x ) is called a (conditional) probability density function and hence 
•sbcu-.V? p&sssss Abr properties' of a pttjbsbttity aensuy function Jrix( t*f ** 
clearly nonnegative, and 

fr,Ay)x)dy~ f 4r (x,y) ^ 

•'-« /i(*) 

AW 


l 
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T he density /y| X (* |.x) is a density of the random variable Y given that x 
is the value of the random variable X. In the conditional density fy\x{ m |jc), 
x is fixed and could be thought of as a parameter. Consider fy j x (* |x 0 ), that is, 
the density of Y given that X was observed to be x 0 * Nowy"^y(.x, y) plots as a 
surface over the xy plane. A plane perpendicular to the x y plane which inter- 
sects the .xy plane on the line ,x = x 0 will intersect the surface in the curve 
fx t r(*o * 3*)- The area under this curve is 


fx,Axoiy)dy=fx(xo)‘ 

* — m 


Hence, if wc dividc/ Xf v (x 0 , y) by/ x (.r 0 ), we obtain a density which is precisely 
fr\x(y\ x o)‘ 

Again, the conditional cumulative distribution can be defined in the 
natural way. 


Definition 14 Conditional continuous cumulative distribution If X and 
Y arc jointly continuous, then the conditional cumulative distribution of Y 
given X ~ x is defined as 

F rv fy\x)=( fy\ X {z\x)dz 

J ~ CD 

for all x such that/ x (x) > 0. //// 


EXAMPLE 12 Suppose A. r (x, y) = (x + >-)/ (0 , i )(x)/ (0 . :>(>’)• 

, rrl rt IX + yWoMhoM-X + y I M 

forO<x<I. Note that 

F m 0’|x) = f f\\x( z \ x ) 

* — ao 

= f 7 ii£ dz = —5— f(x + z) dz 
Jo x + i x + -J Jo 

(xy + y 2 l 2) for 0 < y < I . 


x + i 


llll 


Conditional probability density functions can be analogously defined for 
A-dimcnsional continuous random variables. For instance, 

. i \ /xi.X^Xa.Xj.Xifon *2 l *3 « ** » *i) 

fxt.Xi.XdXi.XikXt’ x 2> X 4l X 3’ X S) - A,.X»(*3**5) 

for/xj, Xj( x 3 » x s) > 0* 
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3 3 More on Conditional Distribution Functions 

We have defined the conditional cumulative distribution J>|jrO|*) for either 
jointly continuous or jointly discrete random variables If X is discrete and Y 
is any random variable, then /VprO’i*) can be defined as P[Y£y\X=* x\ilx 
is a mass point of X We would like to define P[Y £yJJr=x) and more 
generally P[A\X — x] where A is any event for X either a discrete or continu 
ous random variable Thus we seek to define the conditional probability of an 
eient A given a random variable X = x 

We start by assuming that the event A and the random variable X are 
both defined on the same probability space We want to define PH| X** x] 
If X is discrete either x is a mass point of X, or it is not, and if x is a mass point 
of Jr, 


P[/i|Jf«xI« 


PH, ;r=x] 
P[X-x\ ’ 


which is well defined, on the other hand, if x is not a mass point of Jf, we are 
not interested in PH| A" <=* x] Now if X is continuous P[A | X =* x] cannot be 
analogously defined since P[X » x] = 0, however, if x is such that the events 
{x~A<Jf<x + A} have positive probability for every h > 0 then PI A | X » x] 
could be defined as 


Iim PH|x-.A<Jr<x + A] (9) 

provided that the limit exists We will take Eq (9) as our definition of 
P[A ( X « x] if the indicated limit exists, and leave P[A \ X = x] undefined other 
wise (It is, in fact, possible to give P[A | X = x] meaning even ifP[A' = x] = 0, 
and such is done in advanced probability theory) 

We will seldom be interested in P[A | X ** x] per se, but will be interested 
m using it to calculate certain probabilities VVe note the following formulas 

« JUJ ~ ! x = x,l/>(x t ) (10) 

if X is discrete with mass points x u x 2 , 

00 P[A\ « j~JlMX = x]f x (x) dx (11) 


P[A,XeB}= X, PlA\X = Xl Vx(x,) 

|i 


ir X is continuous 
(ni) 


( 12 ) 
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if X is discrete with mass points x u x 2 , . . . . 

(iv) P[A ; X e B] = f P[A | X = x]f x {x) dx (13) 

J B 

if X is continuous. 

Although we will not prove the above formulas, we note that Eq. (10) is 
just the theorem of total probabilities given in Subsec. 3.6 of Chap. I and the 
others are generalizations of the same. Some problems are of such a nature 
that it is easy to find P[A \X= x] and difficult to find P[A ]. If, however, /*(*) 
is known, then P[A] can be easily obtained using the appropriate one of the 
above formulas. 

Remark F x% y {x, y) = dx' results from Eq. (13) by 

taking A = { Y < .y) and B = (- co, x ] ; and F Y {y) = \- m F r]x {y\x)f x (x) dx 
is obtained from Eq. (1 1) by taking A—{Y<y). //// 

We add one other formula, whose proof is also omitted. Suppose 
A ={/i(A', Y) <, z), where /i(*, •) is some function of two variables; then 

(v) P[A\X = x]=P[h(X, Y) <z\X — x]~ P[h(x, Y)<z\X = x], 

( 14 ) 

The following is a classical example that uses Eq. (11); another example 
utilizing Eqs. (14) and'(ll) appears at the end of the next subsection. 


EXAMPLE 13 Three points are selected randomly on the circumference of 
a circle. What is the probability that there will be a semicircle on which 
all three points will lie? By selecting a point “ randomly,” we mean that 
the point is equally likely to be any point on the circumference of the 
circle; that is, the point is uniformly distributed over the circumference 
of the circle. Let us use the first point to orient the circle; for example, 
orient the circle (assumed centered at the origin) so that the first point 
falls on the positive x axis. Let X denote the position of the second point, 
and let A denote the event that all three points lie on the same half circle. 
Xis uniformly distributed over the interval (0, 2n). According to Eq. (11), 
P[A] = J P[A | X - x\f x {x) dx. Note that for 0 < x < n, P[A | X = x] ~ 
_ * 4 - 7t)/27r since, given X = x, event A occurs if and only if the 
third point falls between x — 7 r and 7 T. Similarly, P[AjX = x] = 
(x + n- ri)!2n for u <> x < 2n. HenceP|/l] = ]l n P[A | X = *](l/27r) dx = 
(l/27r){Jg[(27r - x)/2tt] dx + x*2n) dx} = I HI 
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3.4 Independence 

When v>e defined the conditional probability of two events in Chap l, we also 
defined independence of events We have now defined the conditional ditto- 
bution of random variables, so we should define independence of random 
variables as well 

Definition 15 Stochastic independence Let (A\, Aj, . , X k \ be a 
^-dimensional random variable X t , X 2 , . . X k are defined to be 
stochastically independent if and only if 

F*. ,t (IS) 

for all at,, x 2 , . v* Ifll 

Definition 16 Stochastic independence Let (A\, X t , , A't) be a 
ft-dimensional discrete random variable with joint discrete density func- 
tion f Xi Xk ( , . ) X t , , X k are stochastically independent if and 

only if 

fx u . **) = £/*.<*») 

for all values (jfj, , r*) of [X lt . , X k ) III! 

Definition 17 Stochastic independence Let (X lr .. , JT t ) be a i-dimen- 
siattal continuous random variable with joint probability density function 
fx„ x k ( . - ) A”i. , X k are stochastically independent if and only 

if 

fx, --»*») = ( 17 > 

for all x„ ,x* HI! 

Remark Often the word “stochastic illy*' will be omitted //// 

We saw that independence of events was closely related to conditional 
probability, likewise independence of random variables is closely related to 
conditional distributions of random variables For example, suppose X and 
Y are two independent random variables , then f x% y (x, y) ~f x (x)f x (y) by defini- 
tion of independence, however. f x y (x, y) - fr [x \y\x)f x {x) by definition of 
conditional density, which implies that/ m (y[x) = f^y), that is, the conditional 
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density of Y given x is the unconditional density of Y. So to show that two 
random variables are not independent, it suffices to show that/y| <v (y|x) depends 
on 


EXAMPLE 14 Let X be the number on the downtumed face of the first 
tetrahedron and Y the larger of the two downturned numbers in the ex- 
periment of tossing two tetrahedra. Are X and Y independent? Ob- 
viously not, since/ m (2 1 3) = P[ Y = 2 \ X = 3] = 0 ^/ y (2) = P[ Y = 2] =-&. 

Illl 

EXAMPLE 15 Let f XtY {x, y) = (x 4- y)/ ( o,t)(*)f( 0 , i)O0- Are therefore X 
and Y independent? No, since f Y]x {y\x) = [(* + y)[{x + i)]/ {0( i,00 for 
0 < x < \ify\x(y\x) depends on x and hence cannot equal f r (y). HU 


EXAMPLE 16 Let f x% y{x, y) = e'~ {x+y) J i0ta , ) (x)I {0t1x:) (y). X and Y are 
independent since 

fx.A*’ y ) = i*~ x h o.oo,W]K"/ ( o.«»(y)] =fx(x)fr (y) 
for all (*, y). //// 

It can be proved that if A'j A* are jointly continuous random variables, 

then Definitions 15 and 17 are equivalent. Similarly, for jointly discrete 
random variables. Definitions 15 and 16 are equivalent, it can also be proved 

k 

that Eq. (15) is equivalent to P[ X t e B t ; . . . ; X k e B k \ = P[ X { e B L J for sets 

i= 1 

Bn . B k . The following important result is easily derived using the above 
equivalent notions of independence. 

Theorem 3 If X x , X L are independent random variables and 
are k functions such that Y } = gj(Xj), j = 1, k are 
random variables, then Y t , •••» J* are independent. 

proof Note that if gjHBj) = {z: gj(z)eBj }, then the events 
{ Yj e and { A, e gj l (Bj)} are equivalent; consequently, P[ ^ e B t ; . . . ; 

* Y\P[XjegJ l (Bj)] 

i 


mi 
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for k = 2, the above theorem states that if two random variables, say 
X and 7, are independent, then a function of X is independent of a function of 
Y Such a result is certainly intuitively plausible 

We will return to independence of random variables in Subsec 4 5 
Equation (14) of the previous subsection states that P[h(X, Y) £ t\ X a *] 
<^Plh(x t J')£r(A'=jel Now if X and Y are assumed to be independent, 
then P[h(x, Y\<.z\X=* x\~ P[h(x s Y ) £ z] which is a probability that may 
be easy to calculate for certain problems 

EXAMPLE 17 Let a random variable y represent the diameter of a shaft 
and a random variable X represent the inside diameter of the housing 
that is intended to support the shaft By design the shaft is to have 
diameter 99 5 units and the housing inside diameter 100 units If the 
manufacturing process of each of the items is imperfect, so that in fact Y 
« uniformly distributed over the interval (95 5, 100 5) and X a uniformly 
distributed over (99, 101), what is the probability that a particular shaft 
can be successfully paired with a particular housing, when “successfully 
paired" is taken to mean that X - h < Y < X for some small positive 
quantity A? Assume that X and Y are independent, then 

P[X~h< Y<X]*= f P{X-h<r<X)X~x]f x (x)dx 

-toi 

= J P[x~h< Y<x]\dx 
Suppose now that h- 1 , then 

for 99 < jc 5 99 5 
for 99 5 <x< 1005 
for 100 5 < x £ 101 

Hence, 

FIX- 1<Y<X\~ /“/>[* ~l<r<x]idx 

-99 j 

-j i(x~9Z5)\dx 

•'99 

(.100* -101 

+ J *<» dx + (i)(IOO 5 - x + ltf dx * A 

** 5 100 5 

//// 
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4 EXPECTATION 

When we introduced the concept of expectation for univariate random variables 
in Sec. 4 of Chap. II, we first defined the mean and variance as particular expec- 
tations and then defined the expectation of a general function of a random vari- 
able. Here, we will commence, in Subsec. 4.1, with the definition of the 
expectation of a general function of a A'-dimensional random variable. The 
definition will be given for only those A'-dimensional random variables which 
have densities. 


4.1 Definition 

Definition 18 Expectation Let (A'j,..., A\) be a ^-dimensional ran- 
dom variable with density fx u .„ t x£ m * •••> *)• The expected value of a 
function g(\ •) of the A'-dimensional random variable, denoted by 

6[g{X i, JfJ], is defined to be 

£[g(X i* **m X k )] = x k)fx t xS x i* •••»■**) 08) 

if the random variable (X x , X k ) is discrete where the summation is 
over all possible values of (A\, , . . , A'*)* and 

mx x k )] 

= f f ...f 3 (x„...,x l )/ Xl ..., Xk (x I x k )dx t ...dx k (19) 

if the random variable ( X X k ) is continuous. //// 

In order for the above to be defined, it is understood that the sum and 
multiple integral, respectively, exist. 

Theorem 4 In particular, if < 7 (x,, ...,x k ) = x t , then 

6[g{X l X k )] = S[X,] = n Xl . (20) 

proof Assume that (X it . .., X k ) is continuous. [The proof for 
( X t , X k ) discrete is similar.] 

wg&i = r r — r **/■* *(*»• • • • * ^ • ** 

- — CO J -00 - ” CO 

= f x,/ Xl (X|) dx, = SIX,] 

*'-co 
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using the fact that the marginal density/* far,) is obtained from the joint 
density by 

J" j jx xM„ , xj dx t dx dx„, dx k //// 

Similarly the following theorem can be proved 
Theorem 5 If g{x x , x t ) — (x, — <f[Aj]) 2 * then 

Wx jroi-m.-^uyrt-wur,] mi 

We might note that the “expectation in the notation d[X t ] of Eq (20) 
has two different interpretations one is that the expectation is taken over the 
joint distribution of X t , X k and the other is that the expectation is taken 
over the marginal distribution of X t What Theorem 4 really says is that 
these two expectations are equivalent and hence we aTe justified in using the 
same notation for both 


EXAMPLE 18 Consider the experiment of tossing two tetrahedra Let X 
be the number on the first and Y the larger of the two numbers We gave 
the joint discrete density function of X and Y in Example 2 

*[Xn-Y J xyf x ,(x,y) 

= 1 TO+l 2(4) + l 3(4) + 1 4(4) 

+ 2 2(4) + 2 3(4) + 2 4(4) + 3 3(4) 

+ 3 4(4) +4 4(4) -J# 

*!*+ y]=d + i>4-(-(i +2)4 + (1 +3)4+0 +4)4 
+ (2 + 2)4 + (2 + 3)4 + (2 + 4)4 + (3 + 3)4 
+ (3 + 4)4 + (4 + 4)4 = t§ 

*1*1 = 1 and <?( Y] = ff, hence S[X + T] = g[X] + ^y) 0 

EXAMPLE 19 Suppose/* ,(* y) = [x+y)I {0 „(*)/<„ u (y) 
dlXY] *= j Q x >( x + y)dxdy = i 
+ y) t= J J (x + y)(x + y) dxdy = % 
«?m=4n=4 //// 
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EXAMPLE 20 Let the three-dimensional random variable (Z„ X 2 , X 3 ) have 
the density 

fxi,Xi,X 3 (Xu x 2 » *3 ) = 8aT 1 X 2 X 3 / (0i d(Xj)/( 0j 1)(x 2 )/(0, 

Suppose we want to find (i) X[3X, +2X 2 + 6X 3 ], (ii) <^{X 1 X 2 X 3 l and 
// (iii) <^[XiX 2 ]. For (i) we have g(x u x 2 , x 3 ) = 3x, + 2x 2 + 6*3 and 
obtain 

<?[g(X 1 , X 2 , X 3 )] = £{3X l + 1X 2 + 6X3] 

= Jo Jo Jo (3xj + 2 x 2 + 6x 3 )8x 1 x 2 x 3 dx x dx 2 dx 3 =^-. 

For (ii), we get 

<?[X t X 2 X 3 ] = jh JJ 8 x! 2 x 2 2 X3 2 dx 1 dx 2 dx 3 = ■£?, 
and for (iii) we get d[X,X 2 ] = £. //// 


The following remark, the proof of which is left to the reader, displays a 
property of joint expectation. It is a generalization of (ii) in Theorem 3 of 
Chap. II. 

[ m 1 m 

Z c i9i{X u . . . , X k )\ = Z c i <% i (7 1 , . . . , X k )] for constants 

c»c 29 ... 9 c m . nn 


4.2 Covariance and Correlation Coefficient 


Definition 19 Covariance Let X and 7 be any two random variables 
defined on the same probability space. The covariance of X and 7, 
denoted by cov [ X 7] or <r x> y , is defined as 

cov [X, 73 = <?[(* - fe)(7-/iy)3 (21) 

provided that the indicated expectation exists. //// 


Definition 20 Correlation coefficient The correlation coefficient , de- 
noted by p[7, 7] or y , of random variables Z and 7 is defined to be 


Px, Y = 


cov [7, 7] 


( 22 ) 


provided that cov [ X , 7], a x , and <r y exist, and a x > 0 and cr y > 0. HU 
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Both the coiariance and the correlation coefficient of random vanabln 
X and Y are measures of a linear relationship of X and Y in the following sense: 
cov [*, }'l will be positive when X - (t x and Y ~ fi r tend to have the same sign 
with high probability, and cov [ X , Y] will be negative when X— fi x and F- j,, 
tend to have opposite signs with high probability cov IAT, F] tends to measure 
the SineaT relationship of X and Y, ite actual magnitude does not have 

much meaning since it depends on the variability of X and Y. The correlation 
coefficient removes, in a sense, the individual variability of each X and F by 
dividing the covariance by the product of the standard deviations, and thus the 
correlation coefficient is a belter measure of the linear relationship of X and y 
than is the covariance Also, the correlation coefficient is unttless and, as we 
shall see in Subsec 4 6 below, satisfies - 1 £ p Xt r £ 1. 

Remark cov (A, Y] - g[(X-p x ){Y- t i r )] « S[XY] - p x p T . 
proof S[(X - n x )( Y - /i r )] = S[XY~ p x Y-p r X+ » x p r ) 

« £[XY\ -p x S[Y]-Pt SIX] + w, 

~sixY]-it x v t > mi 


EXAMPLE 21 Find p x r for X, the number on the first, and X, the larger of 
the two numbers, in the experiment of tossing two telrahedra We would 
expect that p x r is positive since when X is large, Y tends to be large too 
We calculated S[XY], S[X] t and S[Y\ in Example 18 and obtained 
e\XY] = S[ X] = f , and S\ Y ] « Thus cov [X, F] = -W" “ i 

“if Now 4r and £[F J ] - hence var[.F]*£ and 

varm = H So, 


Px,r 


-M L 


till 


EXAMPLE 22 Find p x Y for X and Y if r (x, y) » ( x +y)/^ lx (x)fa u (y) 
We saw that S[ XY) = | and S\X] = S [ F] ~ -ft in Example 19 Now 
SIX 1 ] = SlY 1 1 = -ft; hence var [ AT] = var {TJ = ft*. Finally 


Px.r 


_ i ~ r£c 

I 1 * " 


Does a negative correlation coefficient seem right? 


1 

II 


llll 
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4,3 Conditional Expectations 

In the following chapters we shall have occasion to find the expected value 
of random variables in conditional distributions, or the expected value of one 
random variable given the value of another. 

Definition 21 Conditional expectation Let (A", T)be a two-dimensional 
random variable and g(-, •)» a function of two variables. The conditional 
expectation of g(X, Y) given X = denoted by S[g(X,Y)\X- x], is 
defined to be 


Y) \X~X] - f 9(X, y)fy lx (y|x) dy (23) 

J -oo 

if (X, Y) are jointly continuous, and 

S[g{X,Y)\X = x) = £ g{x,y } )f r Mx) (24) 

if ( X , Y) are jointly discrete, where the summation is over all possible 

values of Y. ., //// 

In particular, if g{x, y) = y, we have defined <?[Y\X= x] = <?[Y\x]. 
£[Y\x] and 8[g(X, y)|.x] are functions of x. Note that this definition can be 
generalized to more than two dimensions. For example, let (X u .... X k , 
y„ .... Y„) be a (k + m)-dimensional continuous random variable with density 
fxi,...,x k , y •••>■**» Yi> •••> )’m)t then 

S[g(X u ...,X k , Y m )\x l ,..,,x k ] 

g(x t ,...,x k ,y it ...,y m ) 

*Jy •••» •••’ **) dy i • • • dy m . mi 



EXAMPLE 23 In the experiment of tossing two tetrahedra with X, the 
number on the first, and Y, the larger of the two numbers, we found that 


A- for y = 2 

/r|x(y|2)= i for y = 3 

b for y = 4 

inExample9. Hence#[y| X = 2] = Xy/VixCH-y = 2) = 2'\ + 3-^ + 4-£ 

= M III I 
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EXAMPLE 24 For/* Y (x, y) = (x +y)I t o ijW^o «C y), wc found that 
f ri x(y\x)^~^l (0i) {y) 
for 0 < x < l in Example 12 Hence 

flP|jr-,l-JT , ,|±|4r- 3 ^ ? 0+i) 

focO<x<t //// 


As we stated above, £[ff(Y))x] is, in general, a function of x, Let us 
denote it by ft(x), that is, h{x) = <f[g( Y)\x] Now we can evaluate the expects 
tion of AW a function of X, and will have 6[h(X)] = $[6[g{ T)| Afl] 

This gives us 

- mxn - f KrttM d * 

~f_mv)w x wx 

m J ^l j(y)frix(yM <fyj/x(x) dx 

- j_ ^ I *y*C*) d y dx 

*} j 8{y)Jx rlx,y)dydx 
= 6[9iY ) ] 

Thus we have proved for jointly continuous random variables X and Y 
(the proof for X and Y jointly discrete is similar) the following simple yet very 
useful theorem 


Theorem 6 Let {X, Y) be a two-dimensional random variable, then 


and in particular 


t[9{Y)] = mg{Y)\X]] ( 25 ) 

= (26) 

mi 


Definition 22 Regression curve £{Y\X = x] is called the regression 
curce of Y on x It is also denoted by Hi! 
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Definition 23 Conditional variance The variance of Y given X = x is 
defined by var [ Y\ X = x] = <?[ y 2 1 X = x ] - (<?[y | * = *]) 2 . //// 

Theorem 7 var [ y] = <?[var [ y | *]] + var [cf [ Y | X]}. 

PROOF 

[var [Y\X]) = 6[6[ Y 2 \X]]~ 6[{6\ Y\ X]) 2 ] 

= Y 2 ] - (<f[ y ]) 2 - <f [(<r[ y | *]) 2 ] + (<f[yj) 2 

= var [ Y] - <?[(<?[ y| X ]) 2 ] + ((?[<?[ y| *]]) 2 
— var [ y] — var [<?[ y| ^TJ]. //// 

Let us note in words what the two theorems say. Equation (26) states 
that the mean of Y is the mean or expectation of the conditional mean of Y, 
and Theorem 7 states that the variance of Y is the mean or expectation of the 
conditional variance of 7, plus the variance of the conditional mean of 7. 

We will conclude this subsection with one further theorem. The proof 
can be routinely obtained from Definition 21 and is left as an exercise. Also, 
the theorem can be generalized to more than two dimensions. 

Theorem 8 Let ( X , Y) be a two-dimensional random variable and 
g x (*) and g 2 { *) functions of one variable. Then 

0 ) Y)+g 2 {Y) \X = x] = 8[g t (Y)\X » x] + 8[g 2 {Y)\X = *]. 

(ii) *foi( Y)g 2 {X ) \X = x]= g 2 {x) S[ 9l { Y)\X = x]. I HI 


4.4 Joint Moment Generating Function and Moments 

We will use our definition of the expectation of a function of several variables 
to define joint moments and the joint moment generating function. 

Definition 24 Joint moments The joint raw moments of X u X k 
are defined by 8[X[ l X r 2 2 * • * X r k % where the rjs are 0 or any positive 
integer; the joint moments about the means are defined by 

<?[(Xi-fix t Y l -(Xk-ftx k Y k l HI! 

Remark If r t = = 1 and all other r m ’s are 0, then that particular joint 

moment about the means becomes S[{X t — ~ /*x y )L which is just 

the covariance between X t and Xj. //// 
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Definition 25 Joint moment generating function The joint moment 
generating function °f (^i » > ,s defined by 

m Xi , t*) » (27) 

if the expectation exists for all values of t it , t k such that -A <tj<k 
for some A >0 J *=^ k M 

The rth moment of X } may be obtained from m x , xJ r i . < 4 ) by 
d fferentiating it r times with respect to t } and then taking the limit as all the t s 
approach 0 A>so X)] can be obtained by differentiating the joint moment 
generat ng function r times with respect to t t and s times with respect to and 
then taking the limit as all the t s approach 0 Similarly other joint raw 
moments can be generated 

Remark m x (h ) » m x 0) = lmm J[ Y (ti / x ) andm r (/j) #=!«* r (0 /j) 

= limm* Y (f t r*) that is the marginal moment generating functions can 

1 -.0 

be obtained from the joint moment generating function //// 

An example of a joint moment generating function will appear in Sec 5 
of this chapter 

4 5 Independence and Expectation 

We have already defined independence and expectation in this section we w 11 
relate the two concepts 

Theorem 9 If X and Y are independent and g t ( ) and g 2 ( ) are two 
functions each of a single argument then 

woof We will give the proof for jointly continuous random 
variables 

<ffoi(X)3a(V)3 - J j jitodiiytfx »(* y) <*y 
m !_ m S J i(. x )#i(y)fx.(x}fr(y) dx *y 
= J Jii*)fx(x)dx j j l {y)f I [y)dy 
= W3 ^ a (T)3 " llll 
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Corollary If X and Y are independent, then cov [ X , Y] ~ 0. 

proof Take g x (x) — * — g x and g 2 {y) = y — g Y ; by Theorem 9, 

cov [A', Y] = <f[(Af - /j,)] = *[ft(Afe 2 (r)] 

= S[g l {X)]S[g 2 {Y)] 

= #[X-ii x ]-S[Y-ii y ] = 0 since — /<*] = 0. //// 

Definition 26 Uncorrelated random variables Random variables A" and 
T are defined to be uncorrelated if and only if cov [A", Y] = 0. //// 


Remark The converse of the above corollary is not always true; that is, 
cov [X, Y] = 0 does not always imply that X and Y are independent, as 
the following example shows. //// 


EXAMPLE 25 Let U be a random variable which is uniformly distributed 
over the interval (0, 1). Define X = sin 2nU and Y = cos 2 nU. X and 
Y are clearly not independent since if a value of X is known, then U is 
one of two values, and so Tis also one of two values; hence the conditional 
distribution of Y is not the same as the marginal distribution. <?[y] = 
Jo cos Inudu = 0, and &[X] = Jo sin Inudu = 0; so cov [X, Y]= &[XY] = 
Jo sin 2 nu cos 2nu du — \\ q sin 4nu du = 0. //// 

Theorem 10 Two jointly distributed random variables X and Y are 
independent if and only if m Xt rU u t 2 ) = ™ x (t 1 )}n Y (t 2 ) for all t lf t 2 for 
which —A < < A, / = 1, 2, for some h > 0. 

PROOF [Recall that m x {t x ) is the moment generating function of X. 
Also note that m x {t x ) =m Xt y(r 1 , 0).] X and Y independent imply that 
the joint moment generating function factors into the product of the 
marginal moment generating functions by Theorem 9 by taking g x {x) = e Ux 
and g 2 (y) = e? 2y . The proof in the other direction will be omitted. 

mi 


Remark Both Theorems 9 and 10 can be generalized from two random 
variables to k random variables. //// 
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4 6 Cauchy Schwarz Inequality 

Theorem II Cauchy Schwarz Inequality Let X and Y have fin le 
second moments then tflXYjf = with equality 

if and only tf /’O' * cX\ = l for some constant c 

proof The existence of expectations <$[X] d[ Y] and d[XY\ 
follows from the existence of expectations d[X 2 \ and <f[y J ] Define 
0 5 h{t) - e[(tX- Y) 2 ] = £[X 2 ]t 2 - 2£[XY]t + <f[ L 1 ] Now h(t) is a 
quadratic function in t whtch is greater than or equal to 0 If h(t)>0 
then the roots of h(t) are pot real, so 4(<J[Afy])* ~ 4(?[ V 1 ] < 0 
or (dpT]) 1 If MO ~ 0 for some t, say t 0 , then 

«f [(r 0 X - T) s l = 0 which implies / > [t Q Y] =* I //// 

Corollary \p x y| 5 l with equality if and only tf one random variable 
is a linear function of the other with probability } 

proof Rewrite the Cauchy Schwarz inequality as \£\ VV \\<, 
JtW'TtlV 2 ] and set U * X - p x and V = Y - ft, //// 


5 BIVARIATE NORMAL DISTRIBUTION 

One of the important multivariate densities is the multnanate tiornal 
density which is a generalization of the normal distribution for a unidimensional 
random variable In this section we shall discuss a special case, the case of the 
bivariate normal In our discussion we will include the joint density marginal 
densities conditional densities conditional means and variances covariance 
and the moment generating function This section then wilt give an example 
of many of the concepts defined m the preceding sections of this chapter 


5 1 Density Function 


Definition 27 Bivariate normal distribution Let the two dimensional 
random variable ( X T) have the joint probability density function 


/n(J c y)=/(* y)= 


2« a x — p 1 


7-Hr V 
o t ) 


( 28 ) 
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z 



for — co <* < oo, — oo <y < oo, where a Y , o x , p x , p Yl and p are con- 
stants such that — 1 < p < 1, 0<Cy, 0 <cr x , -oo <p x <co, and 
— oo < p Y < °o* Then the random variable (X> 7) is defined to have a 
bivariate normal distribution . //// 

The density in Eq. (28) may be represented by a bell-shaped surface 
z =/(*» >0 as in Fig. 7. Any plane parallel to the xy plane which cuts the 
surface will intersect it in an elliptic curve, while any plane perpendicular to the 
xy plane will cut the surface in a curve of the normal form. The probability 
that a point ( X , Y) will lie in any region R of the xy plane is obtained by 
integrating the density over that region; 

/>[(*, Y) is in R] = JJ f{x, y ) dy dx. (29) 

R 

'The density might, for example, represent the distribution of hits on a vertical 
target, where x and y represent the horizontal and vertical deviations from the 
central lines. And in fact the distribution closely approximates the distribution 
of this as well as many other bivariate populations encountered in practice. 

We must first show that the function actually represents a density by 
showing that its integral over the whole plane is 1 ; that is, 

f J f(x,y)dydx = l . (30) 

The density is, of course, positive. To simplify the integral, we shall substitute 

«= X -=J* and ,=^1. (31) 

°X a Y 
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so that it becomes 


r" r“ 1 _ ^ j u 

m 2 tu /1 — p 1 . 


0 2S71-P 1 ^ *-rv* 

On completing the square on « in the exponent, we have { * r *■ ’ * 


O', 


2^1 - p 2 




and if we substitute 


ft 


du 


= — r==- and <fiv = — p==> 

</i-V 7 i-p 2 


the integral may be written as the product of two simple integrals 
10 1 
n /2tt 


J ~" e ~ w,fI dwJ rfi>, (32) 


both of which are 1 as we have seen in studying the univariate normal distil 
bution Equation (30) is thus verified 


Remark The cumulative bivariate normal distribution 

may be reduced to 3 form involving only the parameter p by making the 
substitution in Eq (31) fill 


5 2 Moment Generating Function and Moments 

To obtain the moments of X and y, we shall find their joint moment generating 
function which is given by 

«* lOu M = mtV'U - <fly ,T4,,r l~ f f e“' + '^y \dy dx 


Theorem 12 The moment generating function of the bivariate normal 
distribution is 


w(<i. h) - **P[<iPx + U Pt + K tjai + 2p»i«i o x Or + <1 <*?)) 


(33) 
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proof Let us again substitute for x and y in terms of u and v to 
obtain 


m(i u t 2 ) 


-CO -CO 

= e f i e U<r x 

" — CO " — CO 


U + f 2&yV 


2nJ 1 - p 2 


e -Ci/(l-p 2 )Ku 2 -2puu + u 2 ) 


(34) 


The combined exponents in the integrand may be written 
“ on “ 2 P UV + ° 2 ~ 2 0 “ P 2 )h ff x« ~ 2(1 - P 2 )^ ov i>], 

2(1 - P ) 

and on completing the square first on u and then on u, we find this expres- 
sion becomes 


- 2 (f ~ p 2 j -PP-0 - P 2 )'i^] 2 + ( ! - P 2 )(o - ph°x- h Vr) 2 

— (1 — p 2 )(/ i<t|- + 2pt l t 2 <r x cry + c 2 )}, 

which, if we substitute 

U- pD- (l ~ p 2 )ti<j x 


W — 


yfl -P 2 


and z ~ v — p^c* — ?2 > 


becomes 


•“ + 2pt t t 2 <Tx a Y + tiGy)* 

and the integral in Eq. (34) may be written 
m(*i, t 2 ) = e'^x+'^r exp[i(rfrx + 2pt i t 2 a x Cy + tja})] 

X f f — e -wS /2-x J /2 

J-a > •'-0)271 

= exp^p* + t 2 p Y + i(l 2 °x +2pt l t 2 a x <jy + tfoy)] 
since the double integral is equal to unity. 


Theorem 13 If (A', y) has bivariate normal distribution, then 

m=p x , 

W = p Y , 

var [X] = a\, 
v ar[T] = try, 
cov [X, Y] = p<J X <Jy, 



t66 JOtVf AVOTOSDmOVALDlSTWBmiO'JS, STOCHASTIC ^DEPENDENCE 


and 


Px.r = P 


PROOF The moments may be obtained by evaluating the appro- 
priate derivative of m(r Jt i 2 ) at t t =* 0, t 3 = 0. Thus, 


m 


dm 

"tot , 




~t‘x 

=}l\ + c\. 


Hence the variance of X is 

Similarly, on difTerentiatmg with respect to tj , cme finds the mean and 
variance of Y to be p r and «r We can also obtain joint moments 

£(J KTY*\ 

by differentiating t 2 ) r times with respect to f t and s times with respect 
to fj and then putting ? v and tj equal to 0 The covariance of X and fis 

m - H X ){Y- p r )} rn g[XY- Xfir - Yp x + p x p r ] 
~*{XYl-[i x tit 



** P ff jr°V • 

Hence, the parameter p is the correlation coefficient of X and Y. fill 

Theorem 14 If ( X t Y) has a bivariate normal distribution, then X and Y 
are independent if and only if X and Y are uncorrelated 

proof X and Y are uncorrebted ir 3 nd only if cov {A*, y] ** 0 or, 
equivalently, if end only if p x t = p*=0 It can be observed that if 
9 * 0, the joint density f(x, y) becomes the product of two univariate 
normal distributions, so that />» 0 implies X and Y are independent 
We know that, in general, independence of A' and Y implies that X and ^ 
are uncorrebted //// 
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5.3 Marginal and Conditional Densities 


Theorem 15 If (A^, Y) has a bivariate normal distribution, then the mar- 
ginal distributions of X and Y are univariate normal distributions; that is, 
X is normally distributed with mean p x and variance <r|, and Y is nor- 
mally distributed with mean p Y and variance a\ . 

proof The marginal density of one of the variables X , for example, 
is by definition 

fx(x) = f f(x,y) dy; 

J —oo 

and again substituting 

v = y _Tl 

<Ty 

and completing the square on u, one finds that 


1 

fx (*) “ J ” nr— 2 

J -cc 27ICT Xn / 1 — p z 


Then the substitutions 

v -p(x- Hx)lc x 


w = 


Vi-p 2 


and rfw = 




Vi -p 2 


show at once that 






the univariate normal density. Similarly the marginal density of Y may 
be found to be 


/rQO = 



till 


Theorem 16 If (X, 7) has a -bivariate normal distribution, then the 
conditional distribution of X given Y = y is normal with mean 
H x + (poxl ff r)(y — Pr) an d variance-oiO — P 2 )- Also, the conditional 
distribution of Y given X = x is normal with mean + (pa Y la x )(x - p x ) 
and variance crf(l — p 2 ). 
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proof The conditional distributions are obtained from the joint 
and marginal distributions Thus, the conditional density of X for farf 
values of Y is 

and, after substituting, the expression may be put in the form 


1 f 1 

s»S(i -p') 




(K) 


which is a univariate normal density with mean p x + Cp<t*/ity) 0’ - M 
and wuh variance — p z ) The conditional distribution of Y may be 
obtained by interchanging x and y throughout Eq (35) to get 


/rixOM 

(36) 

III) 


As we already noted, the mean value of a random variable in a conditional 
distribution is called a regression curve when regarded as a function of the fixed 
variable m the conditional distribution Thus the regression for X on Y - y in 
Eq (35) is p x + (pffjr/ovXy — jty), which is a linear function of y m the present 
case For bivariate distributions in general, the mean of X in the conditional 
density of X given Y - y will be some function of y, say g{ ), and the equation 

* = <?(>>) 

when plotted in the xy plane gives the regression curve for X It is simply a 
curve which gives (he location of the mean of X for various values of Y in the 
conditional density of X given T = y 

For the bivariate normal distribution, the regression curve is the straight 
line obtained by plotting 


x = 


Px + ~(y-Yx)> 

Cy 


as shown in Fig 8 The conditional density of X given Y *= y s f x \x(.x\y)t ,s 
also plotted m Fig 8 for two particular values y Q and y, of Y 
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PROBLEMS 

J Prove or disprove: 

(a) If P[X > F] = 1, then 6[X] > £[Y]. 

(b) lf£[X]>6inthcnP[X> Yl = l. 

(c) If 6[ X] > 6[ Y] % then P[X >Y)> 0. 

2 Prove or disprove: 

id) If F x (z) > F r {z) for all z, then £[ Y] > S[X], 
ib) If 6[Y] > £[X], then F x {z) > Friz) for all z. 

(c) If £[Y]> 6[X ], then F x (z) > F y (z) for some z . 
id) If F x (z) = F y (z) for all z, then P[X = Y] - I . 

(e) If F x (z) > Friz) for all z, then P[X < Y]> 0, 
if) If X+ l t then Fxiz)=*Friz+ 1) for all z. 

3 If Xi and X x are independent random variables with distribution given by 
p[X t — — 1] ~P[Xt = J] = i for /*= 1, 2, then arc X x and X t X 2 independent? 

4 A penny and dime are tossed. Let X denote the number of heads up. Then 
the penny is tossed again. Let Y denote the number of heads up on the dime 
(from the first toss) and the penny from the second toss. 

(a) Find the conditional distribution of Y given X=\. 
ib) Find the covariance of X and Y. 

5 If X and Y have joint distribution given by 

fx . rix t y) =2/ (0> ,)(x)/(o. i) (y). 

ia) Find cov [ X , Y]. 

ib) Find the conditional distribution of Y given X — x. 

6 Consider a sample of size 2 drawn without replacement from an urn containing 
three balls, numbered 1, 2, and 3. Let X be the number on the first ball drawn 
and Y the larger of the two numbers drawn. 

(a) Find the joint discrete density function of X and Y. 
ib) Find P[X — 1 1 y = 3]. 

(c) Find cov IT, Fj. 



170 JOINT AND CONDITIONAL WSIMBUITOtO STOCHASTIC INDEFEMDEHCE 


7 Consider two random variables X and Y having a joint probability density 
function 

A = .iOOAo dW 

(a) Fmd the marginal distributions of X and Y 

(b) Are X and Y independent? 

8 If X has a Bernoulli distribution with parameter p (that is F{F=]]=/> = | 
-PU=0D <f[y|Jr=OI = 1, and £(Y{X=* 1J-2, what is iflFJ? 

9 Consider a sample of sue 2 drawn without replacement from an um containing 
three balls, numbered 1, 2, and 3 Let A” be the smaller of the two numbers 
drawn and Y the larger 

(o) Find the joint discrete density function of X and Y 

(b) Find the conditional distribution of Y given X~1 

(c) Fmd cov(A", Yl 

JO Let X and Y be independent random variables each having the same geometric 
distribution Fmd P[A = F] 

11 If F( ) is a cumulative distribution function 

(a) Is F(x ji) — F(x) + f(y) a joint cumulative distributio n functio n ? 

(h) Is F(x, y) = F(x)F(y) a joint cumulative distribution function? 

(c) Is F(x, y) -- max [F(x) F(y)] a joint cumulative distribution function ? 

(d) Is F{x, y ) = mm [F{x) F(y)J a joint cumulative distribution function? 

12 Prove 

firto + FVOO- 1 <.Fx ,tx,y)^VF x {x)F,iy) for all x,y 

13 Three fair corns are tossed Let X denote the number of heads on the first two 
coins, and let Y denote the number of tails on the last two coins 

(.a) Find the joint distribution of X and Y 

(b) Find the conditional distribution of Y given that X *= 1 

(c) Find cov (F, Y] 

14 Let random variable X have a density function /( ), cumulative distribution 
function F{ ) mean ft, and variance o l Define Y — tt-Y fiX, where a and /3 are 
constants satisfying -oo<a<eoand/?>0 

(a) Select a and jS so that Y has mean 0 and variance 1 
(4) What is the correlation coefficient between A" and F? 

(e) Fmd the cumulative distribution function of Y in terms of *, /3, and F( ) 

(d) If X is symmetrically distributed about /t, is Y necessarily symmetrically 

distributed about its mean 9 (Hint Z is symmetrically distributed about 
constant C if Z— C and. —^Z— bawt tb/t rams, dtf&'fo&on V 

13 Suppose that random variable A' is uniformly distributed over the interval (0, 1)» 
that is, fx(x) =/ ( o u(x) Assume that the conditional distribution of Y given 
A’=x has a binomial distribution with parameters n and p =*x, i e , 

Fir-=y|A'=*l«=Qx'a-*> > » for 0,1, ,« 
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(<7) FindcffK]. 

(b) Find the distribution of Y. 

16 Suppose that the joint probability density function of (X y Y) is given by 

fx. y(x, y) = [1 - cc(l - 2.v)(l - 2y)]/ (0 ., >(*)/,<>., ,00, 

where the parameter a satisfies — 1 <, 1. 

(a) Prove or disprove : X and Y are independent if and only if X and Y are un- 
correlated. 



An isosceles triangle is formed as indicated in the sketch. 

( b ) If ( X , Y) has the joint density given above, pick a to maximize the expected 
area of the triangle. 

(c) What is the probability that the triangle falls within the unit square with 
corners at (0, 0), (1, 0), (1, 1), and (0, 1)? 

*(d) Find the expected length of the perimeter of the triangle. 

17 Consider tossing two tetrahedra with sides numbered 1 to 4. Let Y x denote the 
smaller of the two downturned numbers and Y 2 the larger. 

(a) Find the joint density function of Yi and Y z . 

(« FindPfFi >2, Y 2 >2]. 

(c) Find the mean and variance of Y x and Y 2 . 

(d) Find the conditional distribution of Y 2 given Y t for each of the possible 
values of Y u 

( e ) Find the correlation coefficient of Y x and Y 2 . 

18 Let f Xt y(x f y) = e‘ <x + y) I { o, *»(x)J<o. «j(y) 

(a) Find P[X > 1]. ( b ) Find P[\ < X+ Y <2]. 

(c) Find P[X < YjX <2 Y]. (d) Find m such that P[X -f Y 

(e) Find P[0 < X <l\Y = 2]. (J) Find the correlation coefficient of A"and Y. 

* 19 Let f x , y (x>y) — e~ x )l {0 . y\{x)I[ o. cd)Cv) + e~ x {\ — e' y }l l0w xi(y)ho. co)W. 

(a) Show that f x . y(*, *) is a probability density function. 

( b ) Find the marginal distributions of X and Y. 

(c) Find £[Y\X = x] for Q <x. 

(d) Find P[X < 2, F < 2], 

(e) Find the correlation coefficient of X and Y. 

(J) Find another joint probability density function having the same marginals. 
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* 20 Suppose X and Y are independent and identically distributed random vanahlet 
with probability density function /( ) that is symmetrical about 0 

(0) Prove that Pt(Af+ F| £2|*ll> i 

{b) Select some symmetrical probability density function /( ), and evaluate 
P[\X+ Y\ Zl\X\] 

* 21 prove or disprove If 4[Y\X] = X,41X\Y1 - F, and both <?{ AT 1 ] aud^F]** 
finite then P\X « F] = l (Possible Hint P[X « Y] = I if var [AT- F} « o) 

22 A multivariate Chebyshev inequality let (A"», , X.) be jointly distributed vmh 

g[X,] ~ p t and var [X,] = oj for/ «* 1, , m. Define Aj — W X t — pj] Sl/mreJ 

Show that PC A A s ] a 1 - / *, for t>0 

23 Let/r( ) be a probability density function with corresponding cumulative dis- 
tribution function F t { ) In terms oYft{ ) and/or Fj( ) 

(a) Find P[X>Xa+ &x\X>xo) 

(b) Find P[x a <X<xo+ dx| X > x«] 

(e) Find the limit of the above divided by A* as Ax goes to 0 

(d) Evaluate the quantities in parts (a) to (e) forfx(x) *=■ A/*'*'/,# .,(*) 

24 Let N equal the number of times a certain device may be used before it break 
The probability is p that it will break on any one try given that it did not break 
on any of the previous tries 

(a) Express this in terms of conditional probabilities 

(i b ) Express it in terms of a density function, and find the density function 

25 Player A tosses a com with sides numbered 1 and 2. B spins a spinner evenly 
graduated from 0 to 3 B s spinner is fair, but <4’s coin is not, it comes up 1 
wtlh a probability p not necessarily equal to I The payoff AT of this game is the 
difference in their numbers {A s number minus B s) Find the cumulative dis 
tnbution function of X 

26 An urn contains four bail , two of the balls are numbered with a 1 , and the other 
two are numbered with a 2 Two balls are drawn from the urn without replace- 
ment Let X denote t> e smaller of the numbers on the drawn balls and Y the 
larger 

(a) Find the joint dersity of X and Y 

(b) Find the marginal distribution of Y 

(e) Find the cov (A/, Y) 

27 The joint probability density function of X and Y is given by 

f* r(jr,y) = 3 (*+>)/,» ii(x+y)/,o t) (x)/<o n( y) 

(Note the symmetry tn * and y ) 

(») Find the marginal density of X 

(1) Fmd P[X + Y< 5] 

(c) Find <5[Y\X*=x] 

GO Find cov [X t F] 
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28 The discrete density of X is given by /,(*) = jc/3 for .v = 1, 2, and f n ,0>|x) is 
binomial with parameters a: and i; that is, 

fn x(y\x) = P[ Y = >'| X = x] » ^ Q)* 

for>> = 0, ...,xand at = 1, 2. 

(o) Find tffA'] and var [A']. 

(6) Find dm 

(c) Find the joint distribution of X and V. 

29 Let the joint density function of X and Y be given by/*, »(.r, y) = 8*y for 0 < * 
<y < 1 and be 0 elsewhere. 

(<j) Find d'fFjA' — jc]. 

(6) Find <5[XY\X — x]. 

( c ) Find var [ Y | X ~ x). 

30 Let }' be a random variable having a Poisson distribution with parameter A. 
Assume that the conditional distribution of X given Y — y is binomially distrib- 
uted with parameters y and p . Find the distribution of X, if X = 0 when Y = 0. 

31 Assume that X and Y are independent random variables and X (Y) has binomial 
distribution with parameters 3 and J (2 and j). Find P[X - Y]. 

32 Let X and Y have bivariate normal distribution with parameters p x = 5, p r « 10, 
oi — 1, and a? «* 25. 

(a) If p > 0, find p when P[4< K< 16| A'~ 5] = .954. 

*(b) If p = 0, find P[Af+ }' £ 16]. 

33 Two dice are cast 10 times. Let X be the number of times no Is appear, and let 
Y be the number of limes two Is appear. 

(a) What is the probability that X and Y will each be less than 3 ? 

(b) What is the probability that X + Y will be 4? 

34 Three coins are tossed n times. 

(a) Find the joint density of X, the number of times no heads appear ; Y , the num- 
ber of times one head appears; and Z, the number of times two heads appear. 

(b) Find the conditional density of X and Z given Y. 

35 Six cards are drawn without replacement from an ordinary deck. 

(o) Find the joint density of the number of aces X and the number of kings Y. 
(b) Find the conditional density of X given F. 

36 Let the two-dimensional random variable (Z, Y) have the joint density 


37 


fx . r(-*> y) = H6 - x-y)l ( o t 2){x)l <2t 4){y)' 


(a) Find €[Y\X — x]. (b) Find 6[Y 2 \X=x]. 

(c) Find var [Y\X = x]. {d) Show that 6[Y] -S[6[Y\X\l 

(e) Find 6[XY\ X = x). 

The trinomial distribution (multinomiarwith k + 1 = 3) of two random variables 
X and Y is given by 


fx. Ax t y) = 


ft! 


xlyl(n — x — y)l 


- t p x gV -P-gY 


for x, y~-0, 1 , n and * F ^ /r, where 0 ^p, 0 <>q % andp-}-<7 ^ 1 . 
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( a ) Find the marginal distribution of Y 

(b) Find the conditional distribution of X given Y, and obtain its expected 
value 

(c) Find p[X, Y] 

38 Let (X, Y) have probability density function /* ,{x, y), and let u(X) and i<F) be 
functions of X and ^respectively Show that 

**= x] = a(ic)tf{i<n | JT* x] 

39 If X and Y are two random variables and 8\y\X**x\= p, where p does not 
depend on x show that var [ F] — <?[var [ Y | XJ] 

40 If X and Y are two independent random variables, does <$[Y[X = x] depend 
on X? 

41 If the joint moment generating function of ( X Y) is given by r (ti, h) = 

exp[l(r i* +r|)] what is the distribution of Y1 

42 Define the moment generating function of Y \X = x Does m r (i) = c?[*iri 

43 Toss three coins Let X denote the number of heads on the first two and Y 
denote the number of heads on the last two 

(o) Find the joint distribution of X and Y 

( b ) Find £[F|A'=1] 

(c) Find p x i 

(d) Cive a joint distribution that is not the joint distribution given in part (a) 
yet has the same marginal distributions as the joint distribution given m 
part (a) 

44 Suppose that X and Y are jointly continuous random variables /m(y|x)« 
4* **dO') and/*(x) = /, 0 i,(x) 

(a) Find £[Y] (b) Find cov IX, F] 

(e) Find PpT+ F<1] (rf) Find/,, »(x)y> 

45 Let (X, Y) have a joint discrete density function 

/* K* y) 

“PlO -Pi) 1 pSO -Pi) 1 ’'ll + «(x —p,)(y —Pi)Vio i ,(x)/,« i, O') 
where 0</>, <J, and — 1 £<x^l Prove or disprove X and Y 

are independent if and only if they are uncorrelated 
*46 Let (X, Y) be jointly discrete random variables such that each X and Y have at 
most two mass points Prove or disprove X and Y are Independent if and only 
if they are uncorrelated 
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DISTRIBUTIONS OF FUNCTIONS OF 
RANDOM VARIABLES 


1 INTRODUCTION AND SUMMARY 

As the title of this chapter indicates, we are interested in finding the distribu- 
tions of functions of random variables. More precisely, for given random 
variables, say X x , X 2 , . - . , X n , and given functions of the n given random variables, 
say #,(•, **»»£*(% ...» *), we want, in general, to find the 

joint distribution of/,, Y 2 ,... , X*, where A'*),/ =*= 1, 2, A. 

If the joint density of the random variables X l$ X 2 * X„ is given, then theo- 
retically at least, we can find the joint distribution of Y u Y 2 , T*. This 

follows since the joint cumulative distribution function of Y ly T* satisfies 
the following: 

^Vi ySyi » * - • * a) = Tj ^ ^ j * • • i ^ a! 

= •» ^») — a] 

for fixed j,, which is the probability of an event described in terms of 

X x% A" n , and theoretically such a probability can be determined by integrat- 
ing or summing the joint density over the region corresponding to the event. 
The problem is that in general one cannot easily evaluate the desired probability 
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for each y x , ,y k One of the important problems of statistical inference, the 
estimation of parameters, provides us with an example of a problem m which 
it is useful to be able to find the distribution of a function of joint random 
variables 

In this chapter three techniques for finding the distribution of functions of 
random variables will be presented These three techniques are called (i) the 
cumulatne distribution function technique, alluded to above and discussed in 
Sec 3, (u) the moment generating function technique, considered m Sea 4, and 
(hi) the transformation technique, considered in Secs 5 and 6 A cumber of 
important examples are given including the distribution of sums of independent 
random variables (in Subsec 4 2) and the distribution of the minimum and 
maximum (in Subsec. 3 2) Presentation of other important derived distributions 
is deferred until later chapters For instance, the distributions of chi square ; 
■Student's t and F all derived from sampling from a normal distribution, are 
given in Sec 4 of the next chapter 

Preceding the presentation of the techniques for finding the distribution 
of functions of random variables is a discussion, given in Sec 2, of expectations 
of functions of random variables As one might suspect, an expectation, for 
example, the mean or the variance, of a function of given random variables can 
sometimes be expressed in terms of expectations of the given random variables 
If such is the case and one is only interested in certain expectations, then it is not 
necessary to solve the problem of finding the distribution of the function of the 
given random variables One important function of given random variables 
is their sum, and in Subsrc 22 the mean and variance of a sum of given random 
variables are derived 

We have remarked several times in past chapters that our intermediate 
objective was the understanding of distribution theory This chapter provides 
us with a presentation of distribution theory at a level that is deemed adequate 
for the understanding of the statistical concepts that are given in the remainder 
of this book 


2 EXPECTATIONS OF FUNCTIONS 
OF RANBOM VARIABLES 

2 1 Expectation Two Way’s 

An expectation of a function of a set of random variables can be obtained two 
different ways To illustrate, consider a function of just one random variable, 
say X Let g{ ) be the function, and set Y = g{X) Since Y is a random 
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variable, S[Y] is defined (if it exists), and £[g{X)] is defined (if it exists). For 
instance, if X and Y = g( X) are continuous random variables, then by definition 

m-CjfM* a) 

^ — CO 

and 

] = r 3 {x)f x (x) dx; (2) 

‘'-00 

but Y = g{X), so it seems reasonable that £[Y] = S[g(X )]. This can, in fact, 
be proved; although we will not bother to do it. Thus we have two ways of 
calculating the expectation of Y = g(X ); one is to average Y with respect to the 
density of 7, and the other is to average g(X) with respect to the density of X . 

In general, for given random variables X u . . . , X n1 let Y = g(X u . . . , X n ); 
then S[Y] — S[g(X l9 . . . , X n )] y where (for jointly continuous random variables) 


yfr(y) d y ( 3 ) 

o 

and 


d[Y] = j 


f C0 |" co 

*„)]= ••• g(Xi,...,x„)f Xt xS x i>--‘>x„)dx i 

* — 00 J — 00 


dx„ 


( 4 ) 


In practice, one would naturally select that method which makes the 
calculations easier. One might suspect that Eq. (3) gives the better method of 
the two since it involves only a single integral whereas Eq. (4) involves a multiple 
integral. On the other hand, Eq. (3) involves the density of Y, a density that 
may have to be obtained before integration can proceed. 


EXAMPLE I Let X be a standard normal random variable, and let g{x) = x 2 . 
For Y = g{X) = X 2 , 

m = f yf Ay) & 

J -00 


and 

d[g(X)] = d[X 2 ] = J“° x 2 f x (x) dx. 

" — oo 


Now 
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and 

jjjgOWjrV**-!. 

using the fact that Y has a gamma distribution with parameters r = J and 
; « j (See Example 2 in Subsec 3 l below ) //// 

2.2 Sums of Random Variables 

A simple, yet important, function of several random variables is their sum. 
Theorem 1 For random variables X t , . X, 

t» 

and 

var [t X,] - 1 var[Xj + 2 ££ covfX, , Xj\ (6) 

proof That <? X,j « J) <f(X<l follows from a property of expec- 
tation (see the last Remark in Subsec 4 1 of Chap IV) 

wr[f Jfj] - S [(£x, - ^[^X,]) 2 ] = ^[(^(X^^XJ)) 2 ] 

- tiX , - £[XMXj - 

- f t £*[(X, - 4XJ)(X y - <f[XjP] 

- £ var[X,3 + 2 £ £ cov[X, , X,J { III 

/*>! 1«^ 

Corollary If Xj, , X*are uncorrelated random variables, then 

mi 

The following theorem gives a result that is somewhat related to the above 
theorem inasmuch as its proof, which is left as an exercise, is similar 
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Theorem 2 Let X u . . . , X n and Y lt „ . . , Y„ be two sets of random vari- 
ables, and let a t , o n and b l% b m be two sets of constants; then 

cov £a,x„£bjYj\ = £ cov[A, , yj. (7) 

Li i J 

//// 

Corollary If X u X n arc random variables and a u a n are 
constants, then 

var £ a, A,J = £ £ a , a } cov[A, , Xj] (g) 

= £ of var[A',J + £ £ a, <r, cov l^/ - Xjl- 

1 i*J 

In particular, if X x , X n are independent and identically distributed 

R 

random variables with mean /z x and variance c\ and if X n = (I//j) £ X i% 

j 

then 

= and var[AJ = — . (9) 

n 

proof Let m - n, Y, = A, , and A, = o,, / = I , w in the above 
theorem; then 

var £ a t A,j = cov £a,X„£bj F,j , 

and Eq. (8) follows from Eq. (7). To obtain the variance part of Eq. (9) 
from Eq. (8), set a, = 1/n and c\ - var [A,]. The mean part of Eq. (9) 
is routinely derived as 

f[X*] = - = “ T, &IX l 3 - " Z l l x —Px- llll 

Corollary If X, and X 2 are two random variables, then 

var [A, ± A 2 ] = var [A,] + var [A 2 ] ± 2 cov [A„ A 2 ], (10) 

nn 


Equation (10) gives the variance of the sum or the difference of two ran- 
dom variables. Clearly 


£{x i ±x 2 \=t[x l ]±e[x 2 }. 


( 11 ) 
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2 3 Product and Quotient 

In the above subsection the mean and variance of the sum and difference of two 
random variables were obtained It was found that the mean and variance of 
the sum or difference of random variables X and Y could be expressed in terns 
of the means, variances, and covariance of X and Y We consider now the 
problem of finding the first two moments of the product and quotient of X and y 

Theorem 3 Let X and Y be two random variables for which vzt[XY] 
exists, then 

fi[X Y\ = fa fa + coy [X, T], (12) 

and 

varpry] 

= f4 var (XI + var [P] + 2fa fa coy [X, fl 

- <cov (X, ID* + fi(X- fa)\ Y - fa) 2 \ + (13) 

lfa£[{X - fa) 1 ( Y - fa)] + 2faS[(X - fa){ Y - /!,)*] 

PROOF 

XY~fafa + (X- fa) fa + ( Y - fa) fa + (X - fa){ Y- fa) 
Calculate d[X P) and d[(Xy) 2 ] to get the desired results HU 

Corollary If X and Y are independent, 6[X PJ - faHy , and var [X Y] = 
fi 1 var (XJ + pj vac (X] + var (X] var ( Y] 

proof If X and Y are independent, 

<?KX - fa)\ Y - Mr) 1 ] - e[{X - fa m<y- Mr) 3 ] 

*= var(X) var(P], 

- fa)\Y- fa )] « £[{X- fa?]i[Y- fa] = 0, 
and 

<?KX-M J r)(r-Mr>*]«0 till 

Note that the mean of the product can be expressed in terms of the means 
and covariance of X and Y but the variance of the product requires higher-order 
moments 

In general, there are no simple exact formulas for the mean and variance 
of the quotient of two random variables in terms of moments of the two random 
variables, however, there are approximate formulas which are sometimes useful 
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Theorem 4 


and 


6 


~X 

Y 


Vx 

Vy 


--4cov[X,y] + ^var[y], 

hY V>Y 


'£] ^ /M 2 /var[Z] + var[ 7 ] _ 2 cov[Z, 7 ] \ 
- y J W/ \ nl Hr HxHr ) 


(14) 


(15) 


proof To find the approximate formula for £[X/Y ], consider the 
Taylor series expansion of x Jy expanded about (fi x , fi Y ); drop all terms of 
order higher than 2, and then take the expectation of both sides. The 
approximate formula for var [Xj Y] is similarly obtained by expanding in 
a Taylor series and retaining only second-order terms. //// 


Two comments are in order: First, it is not unusual that the mean and 
variance of the quotient Xj Y do not exist even though the moments of X and Y 
do exist. (See Examples 5, 23, and 24.) Second, the method of proof of 
Theorem 4 can be used to find approximate formulas for the mean and variance 
of functions of X and Y other than the quotient. For example, 


<$[g(.X, F)] ~ g(Px > Hy) + ^ var[X] £5 g(x, y) 


i! 

'dx 2 


+ ^var[y]^ 5 ff(x, y) 


af 

'dy 2 


+ cov[X, 

MXfMr oy dx 


g(x, y) 


(16) 


and 


var[ 0 (X, Y)] : 


[d 

:var[X]|— g(x, 


y) 


I Px./'y 

+ 2 cov[Z, y] g(x, y) 


) +\ar[Y]l~g(x,y) ] 

J l oy 


•j-gfry) 

Px.ur oy 


(17) 


3 CUMULATIVE-DISTRIBUTION-FUNCTION 
TECHNIQUE 

3.1 Description of Technique 

If the joint distribution of random variables X u . . . , X„ is given, then, theoreti- 
cally, the joint distribution of random variables of y t Y k can bedetermined, 

where Y } = g } {X u .... X„), j = 1, ..., k for given functions^-, •). •••. 
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Ski* » ) defimtion the joint cumulative distribution function of 
r u Y k is F, , n (ft, , ft) ~ W s ft] But for cadi 

J'u * ft the event ■ J* £ft) — {ffi(X lt , X,):sy lP 

Sk (X j, , XJ ^y,,} This latter event is an event described in terms of the 

given functions g { { , , ), , g k { . » ) and the given random variables 

Jfi, , X, Since the joint distribution of X,. , X n is assumed given presum 

ably the probability of event {gi(X lf , X„) Sft, , Sk(X t , , X„) £ yj 
can be calculated and consequently Fy x Yk ( » > ) determined The above 

described technique for deriving the joint distribution of Y u , Y k will be 
called the cumulative-distribution function technique 

An important special case arises if k ~ I , then there is only one function 
say $(Ai, » X,) of the given random variables for which one needs to derive 

the distribution 


EXAMPLE 2 Let there be only one given random variable say X, which has a 
standard normal distribution Suppose the distribution of Y = gfX) *= X J 
is desired 


Fr(y) 


- P[Y iy]* FIX 1 s y] » P[-*/y <; X <. ,/tf - -<X~yfy) 

=■ 2 f ^(w)du*=2 f -^ste'^du 

J o -’e y/2n 


<j2n , ’o 2yfz 




which can be recognized as the cumulative distribution function of a 
gamma distribution with parameters r = $ and A «= i III! 


Other applications of the cumulative-distribution function technique 
expounded above are given m the following three subsections 

3 2 Distribution of Minimum and Maximum 

. X„ be n given random variables Define X t *= mm [X t , » X,] 
and }'„ = max [X v , , X„] To be certain to understand the meaning of 
T, = max [X lt X,] recall that each X, is a function with domain fl the 
sample space of a random experiment For each to e D X,(cu) is some real 
number Now X. is to be a random variable, that is for each w Yfio) is to be 
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some real number. As defined, Y n (w) = max [X^io ), .... A'„(a))]; that is, for 
a given to, Y„(to) is the largest of the real numbers X^to), . . A'„(<u). 

The distributions of y, and Y n arc desired. F r Jiy) = F[Y n < y] = 
P[Xj :£y; ...; X„ <, y] since the largest of the X,'s is less than or equal to y if 
and only if all the X,'s arc less than or equal to y. Now, if the A^'s arc assumed 
independent, then 


P[A', < y *„<;>■] = n nx, ^ y] = n Fx.iy); 

i- 1 i«i 


so the distribution of = max [X u X n ] can be expressed in terms of the 
marginal distributions of X u X„. If in addition it is assumed that all the 
X x% . X n have the same cumulative distribution* say /**(•), then 


We have proved Theorem 5. 


n 


FI W = {fx(y)Y- 


Theorem 5 If X t , ... , X n are independent random variables and Y„ — 
max [Xt, ..., X„], then 

/>„0) = n Fx,(y)- (18) 

I 

If Xt, ..., X„ arc independent and identically distributed with common 
cumulative distribution function F x (-) , then 

F r Jiy) = [F*(y)]". (19) 

hh 

Corollary IfA',, ... X„ are independent identically distributed con- 
tinuous random variables with common probability density function/ x (-) 
and cumulative distribution function F x (-), then 

fr n (y) = n[F x (y)r~'f x (y). (20) 

PROOF 

fr„(y) = T FrSy) - *teOOr 'My)- Ml 


Similarly, 

F ri (y) = P[r, <,y] = 1 -P[Yt >y] = 1 - P[Xt > y, •••; X„>y] 
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since is greater than y if and only if every A ( > y And if A lf , A, are 
independent, then 

1 — P[A, > y, ,X,>y] = l- ft TO* fi-l-RV-FM 

If further it is assumed that Aj, , A„ are identically distributed with common 
cumulative distribution function F x ( ), then 

1 - fill - **003 = 1 - [1 - F x (y)]\ 

and we have proved Theorem 6 

Theorem 6 If X lt , X M are independent random variables and T, = 
min [A lt , A.] then 

F r ,O0 = l- fni-z^OO] (21) 

1-1 

And if Au , X„ are independent and identically distributed with com 
mon cumulative distribution function F x ( ), then 

flrjM-I -[l-fiCy)]' (22) 

mi 


Corollary If A t , , A„ are independent identically distributed con 
tinuous random variables with common probability density /*( ) and 
cumulative distribution F x { ), then 

fr t (y)=’>[y-F x (y)]-'fAy) (23) 


PROOF 


fn(y) = ^F Ti (y)=n[l - F x (y)r 'fxV) II II 

EXAMPLE 3 Suppose that the life of a certain light bulb is exponentially 
distributed with mean 100 hours If 10 such light bulbs are installed 
simultaneously, what is the distribution of the life of the light bulb that 
fails first, and what is its expected life ■* Let A, denote the life of the ith 
light bulb, then Y t = nun [A lt , A l0 l is the life of the light bulb that 
fails first Assume that the A/s are independent 
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Now f Xl (x) =rh e ~ rhiX ho.< a )(x), and 

SO 

A, O') = 10 (e“ T ^0 10 " 1 ( T ioe" Ti5y )/(o.®)00 

= x^~*Ho.<o)00, 

which is an exponential distributionwith parameter A = -jV; hence = 
1/A — 10. mi 


3.3 Distribution of Sum and Difference of Two Random Variables 

Theorem 7 Let X and Y be jointly distributed continuous random 
variables with density f Xi y (x, y), and let Z = X + Y and V = X — Y. 
Then, 


/z(z) = f fx, y(x> 2 - x) dx = f A, y (z - y, y) dy, 

J — CO *' — CD 


(24) 


and 


r” r“ 

A00 = A, y(*. x - t>) dx = A. r (v + y, y) dy. 

— cn J — 00 


(25) 


proof Wc will prove only the first part of Eq. (24); the others are 
proved in an analogous manner. 

F z {z) = P[Z < z] = P[X +7 < z] = JJ A, y(x, y) dx dy 

= / [/ A.rCx, JO A] ^ 


= f f A.rC*. « — x) du dx 

•I - CO L J - CO 


by making the substitution y — u — X 
Now 


/z(Z) = ^hT = ^’{j'-oo [i'-/ X ' y(X ’ “ ~ X) dX ] dU ) 

= f z — x) dx . 

•* — 00 


//// 
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Corollary If X and Y are independent continuous random variants 
and Z-X+ Y, then 

f z (z) =A + M = J" Ms - xM*) d * = J_ M z - y)fr(y) *>' (26) 

proof Equation (26) follows immediately from independence and 
Eq (24), however, we will give a direct proof using a conditional dtstnbu 
tion formula (See Eq (11) of Chap IV] 

P[Z £ 2 ] = P[X + Y £ s) ~ P[X + Y £ s\X = x]f x {x) dx 

= f” nx+Y£z\f%t.x)dx 

= | F r (S - x)f x (x) dx 

Hence 

/z( z ) * “ [/ F*(z - xy x (x) rfxj 

r dF r (? ~x) x , 

=--fMz-x)Mx)dx HU 

Remark The formula given in Eq (26) is often called the convolution 
formula In mathematical analysis, the function fji ) is called the 
convolution of the functions / r ( ) and f x { ) //// 

EXAMPLE 4 Suppose that X and Y are independent and identically dis- 
tributed with density /*(*) =/,(*) = I (0 „(*) Note that since both X 
and Y assume values between 0 and 1 , Z — X + Y assumes values between 
0 and 2 

M*) = ffiiz - x)Mx) dx m f_J iQ n (z - jc )/ (0 „(*) dx 

* J> *Wo. *>(*) + i>t*)A, .»>(*)} dx 

= Ao, l)( z ) rfJC + Ai 2>(z) J ^ dx 

= z Ao i>( 2 ) + (2 — 2)J £1 2) (z) mi 
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FIGURE 1 


3.4 Distribution of Product and Quotient 

Theorem 8 Let X and Y be jointly distributed continuous random vari- 
ables with density f Xt y(*, y), and let Z = XY and U - X/Y; then 

*■» - C wM x - 3 * - C >) (27) 

and 

f v {ti) - f \y\fx, A^y, y) dy. ( 2g ) 

J — CO 

proof Again, only the first part of Eq. (27) will be proved. (See 
Fig. 1 for z > 0.) 

F z (z) = P[Z £ z] = JJ f x . Ax, y ) dx dy 

xy<,z 

= J° [{”/*. Ax, y) dy ] dx + J“ [jj*. Ax, y) rfy] 
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which on making the substitution u-xy 

- L [C* '(*■ ;) d i\ dx + C [L f ‘ r ( x ’I) *. ] dx 
[I.°. h !i '('■;) dx ] du+ L lCi /x '(*';) *] * 

hence 

... <<«!) 

/M-— 

1111 


EXAMPLE 5 Suppose X and y are independent random variables, each 
uniformly distributed over the interval (0, I) Let 2T = XYsrtd U — XjY 

* n (x) dx 

= Ao ,{x)dx 

- j«> i) (*)jr \ dx = - io s z/ (o i>(z) 

Mu) = J \y\fx r (uy, y) dy 

" J IpI^o i)(«y)/ ( o dCk) dy (see Fig 2) 

= !jy\(ho u(w)/ {0 n(y) + / cl t/.jWy 

“ An 1>(“) J o y dy + /[I „,(«) J / y dy 

- uC«) + 2 ^ M »>(«) 

Note that S[XfY\ = <?[£/] = ijg udu + 1 ffO/uJi/tfsa co, quite dif- 
ferent from g[X]{e[ Y] = l If If 
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y 


1 


FIGURE 2 

4 MOMENT-GENERATING-FUNCTION TECHNIQUE 
4.1 Description of Technique 

There is another method of determining the distribution of functions of random 
variables which we shall find to be particularly useful in certain instances. This 
method is built around the concept of the moment generating function and will 
be called the moment-generating-f unction technique . 

The statement of the problem remains the same. For given random vari- 
ables X u ... f X n with given density f Xt , . . . , xS x i> • ••>*») and given functions 
£j( # , •)> •••• ’)> find the joint distribution of Y t ^g x {X u X „ ), 

...» Y k -g k (X u ..., X n ). Now the joint moment generating function of 
Y u ...» if it exists, is 

i»r, yk (J 1 ,...,^) = ^e' ,r ‘ + - + " <yk ] 

X/x, xjC*|.....30n dx ‘- ( 29 ) 

1=1 

If after the integration of Eq. (29) is performed, the resulting function of 
t XJ t k can be recognized as the joint moment generating function of some 
known joint distribution, it will follow that Y u . . . , Y k has that joint distribu- 
tion by virtue of the fact that a moment generating function, when it exists, is 
unique and uniquely determines its distribution function. 

For k > 1, this method will be of limited use to us because we can recog- 
nize only a few joint moment generating functions. For k = 1, the moment 
generating function is a function of a single argument, and we should have a 
1 _r Tnnmpnt cypnprfltinf? function. 
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! 


This method is quite powerful in connection with certain techniques of 
advanced mathematics (the theory of transforms) which, in many instances, 
enable one to determine the distribution associated with the derived moment 
generating function 

The most useful application of the moment generating function technique 
will be given in Subsec 4 2 There it will be used to find the distribution of 
sums of independent random variables 


EXAMPLE 6 Suppose X has a normal distribution with mean 0 and variance l 
Let Y = X 2 , and find the distribution of Y 


»h(t) = w r ] = C 4 = *'**’ dx 

» JL f" e -H>ll-20 rfx 

J_ o_ , (1 

ji* d-2 


y/2* 

= (l-2t) 


■te-y 


which we recognize as the moment generating function of a gamma with 
parameters r - % and A«sJ (It is also called a chi-square distribution 
with one degree of freedom See Subsec 4 3 of Chap VI ) //// 


EXAMPLE 7 Let X t and X 2 be two independent standard normal random 
variables Let Y, =ff,(X, X 2 ) = X, + X t and Y 2 « g 2 iX it X 2 ) = 
X 2 - X 2 Find the joint distribution of F, and Y 2 

”>r t r t t 0) - <?[e r '‘ +ri, *J 


= 6[e tx ‘ 1 

izon-MTi-xitnj 

= 3[e x » 

i »i» + *l»i+lllj 

« " 


* *»*.<»! 

- 

u, 

- h) 1 (t« + t 2 y 

«exp — 

~2 M P ~2 

* «p(t? 

2 ft * 

+ *1) “ cx P ~l exp- 
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We note that Y t and 7 2 are independent random variables(by Theorem 10 
of Chap. IV) and each has a normal distribution with mean 0 and variance 

2- till 

In the above example we were able to manipulate expectations and avoid 
performing an integration to find the desired joint moment generating function. 
In the following example the integration will have to be performed. 


EXAMPLE 8 Let X t and X 2 be two independent standard normal random 
variables. Let Y — (X 2 — Xtf/2, and find the distribution of Y. 


m r (0 = <f[cxp TO = tfjcxp 
r“ 1 r 

-CCsM 


r ( x 2 - xj 2 1 

exp- - - t 


(x 2 - x ,) 2 , X 2 + x 2 


dx x dx 2 


= J* J* ^ expj - ^ [x?(l - o + 2x,x 2 1 + *1(1 - Olj dx t dx 2 

* 1C “ p [~ ‘~r ( x! + tC )] *‘1 dx ‘ 


'7jCC^ cxp [-K '-'- iC>C^ 

L— .^Zi3 exp (~iT^7^) dX2 

y/l-t Jl-2t y/2n J -» V 21 / / 

= (1 - 2 1)“ 4 = » for 1 < I/2, 

which is the moment generating function of a gamma distribution with 
parameters r = } and X = it hence, 

fr(y ) = lsflini)]y~^~ y ' 2 ho.^y)- Ml 
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4 2 Distribution of Sums of Independent Random Varia 

In this subsection we employ the moment generating function technique to find 
the distribution of the sum of independent random variables 

Theorem 9 If X„ , X, are independent random variables and the 
moment generating function of each exists for all —h < t < h for some 

ft >0, let r=£x„ then 

m r (0 = <f[exp £ X,t] = JW0 f° r —A <t <h 
PROOF 

m r (i) m (f[cxp £X ( /] = <r [iV '] 

= n/r^ * jtk/o 

using Theorem 10 of Chap IV //// 

The power and utility of Theorem 9 becomes apparent if we recall Theorem 
7 cf Chap II, which says that a moment generating function, when it exists, 

determines the distribution function Thus, if we can recognize J"J m Xt (t) as 

j-i 

the moment generating function corresponding to a particular distribution, 
then we have found the distribution of £ X t In the following examples, we 
will be able to do just that 


EXAMPLE 9 Suppose that X u , X. are independent Bernoulli random 
variables, that is,P[X, =* 1] =p, and P[X, = 0] * 1 -p Now 

'”x,(0-pe' + ? 

So - 

m L *,(0 = n«x,(0 * (pc' + 

the moment generating function of a bi norma] random variable , hence 
£ x t has a binomial distribution with parameters n and p //// 
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EXAMPLE 10 Suppose (hat A l4 X n are independent Poisson distributed 
random variables, X { having parameter ?. i . Then 

m xfS) — €[e ,x ‘] = exp — 1), 

and hence 

= n ™x ( (0 = flcxp ;,(e' - 1) = exp £ ;,(e' - 1), 

i»i i«i 

which is again the moment generating function of a Poisson distributed 
random variable having parameter £ A f . So the distribution of a sum of 
independent Poisson distributed random variables is again a Poisson 
distributed random variable with a parameter equal to the sum of the 
individual parameters. //// 

EXAMPLE 1 1 Assume that X u ... ? X n arc independent and identically dis- 
tributed exponential random variables; then 

So 

m Z x,(t) = rwo = (jL-) - 

which is the moment generating function of a gamma distribution with 
parameters n and A; hence, 

ft X.W ” X e ho, a o)W» 

the density of a gamma distribution with parameters n and A. //// 

EXAMPLE 12 Assume that X t , A r n are independent random variables 
and 

then 


and 


Xt-NiVi,*!); 

A(a t ./q, afc?), 
m eiX ft) = exp + iafaft 2 ). 
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I 


Hence 

m u *■ (0 = JK X( (0 = exp[(£ + i(E atf)‘ 2 ]> 
which is the moment generating function of a normal random variable, so 
£a t X,~N(f i a,ti tt '£ g afo?f 

The above says that any linear combination (that IS, £ a i Xi) of inde 
pendent normal random variables is itself a normally distributed random 
variable (Actually, any linear combination of jointly normally distri- 
buted random variables is normally distributed Independence is not 
required ) In particular, if 

X~N{fi x ,<rl), Y~N(v r ,c}), 
and X and Y are independent, then 

X+Y~ N(fi x + ft r , a\ + <rj). 


X - Y~ N(fi x - ft r , a} + cl) 

If X u , A', are independent and identically distributed random van 
ables distributed N(p, a 2 ), then 

thatis,thesamplemeanhasa(notapproximate) normal distribution //// 

In the above examples we found the exact distribution of the sums of 
certain independent random variables Other examples, including the impor- 
tant result that the sum of independent identically distributed geometric random 
variables has a negative binomial distribution, are given in the Problems One 

is often more interested in the average, that is, (l//t)£X,, than in the sum 

Note, however, that if the distribution of the sum is known, then the distribution 
of the average is readily derivable since 

F ( 1 /*>I xfx) - J'jjj' T, x, £ zj *= P j£ x, 2 nr] = F zx (m) (30) 

In Examples 9 to 12 above, where we derived the distribution of a sum, we have 
in essence also derived the distribution of the corresponding average One of 
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the most important theorems of all probability theory, the central-limit theorem, 
gives an approximate distribution of an average. We will state this theorem 
next and then again in our discussion of sampling in Chap. VI, where we will 
outline its proof. 


Theorem 10 Central-limit theorem If for each positive integer n, 
Xu •••» X n are independent and identically distributed random variables 
with mean p x and variance crj, then for each z 

F Zn {z) converges to <J>(z) as n approaches co, (31) 

where 

EjlZJ!* nn 

\/var[X B ] a x lyfn ' 

Wc have made use of Eq. (9), which stated that 6 [X n ] = p x and var [X n ] = 
c\jn. Equation (31) states that for each fixed argument z the value of the 
cumulative distribution function ofZ n , for n = 1,2,..., converges to the value 
<t(z). [Recall that d>(*) is the cumulative distribution function of the standard 
normal distribution.] 

Note what the central-limit theorem says: If you have independent random 
variables X x , . . . , X „ , . . . , each with the same distribution which has a mean and 
variance, then X n =* (1//;) £ X { “standardized” by subtracting its mean and 
then dividing by its standard deviation has a distribution that approaches a 
standard normal distribution. The key thing to note is that it does not make 
any difference what common distribution the X u Z n , ... have, as long as 
they have a mean and variance. A number of useful approximations can be 
garnered from the central-limit theorem, and they are listed as a corollary. 


Corollary If X u X n are independent and identically distributed 
random variables with common mean p x and variance o x , then 


or 


p|a< 

>< a 

A 

. 2: i 

XI 

e 

V * 

1 

e 

(32) 

P[c <X„<d] 


(33) 



(34) 



llll 
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Equations (32) to (34) give approximate values for the probabilities of 
certain events described in terms of averages or sums The practical utility 
of the central limit theorem is inherent in these approximations 

At this stage we can conveniently discuss and contrast two terms that are a 
vital part of a statistician s vocabulary These two terms are limning distribution 
and asympiot c distribution A distribution is called a limiting distribution 
function if it is the limit distribution function of a sequence of distribution 
functons Equation (31) provides us with an example, <I>(z) is the limiting 
distribution function of the sequence of distribution functions F z ( ) F 2l ( ) 

F z ,( ) Also <D(z) is called the limiting distribution of the sequence of random 
variables Z, Z 2 Z„ On the other hand an asymptotic distribution 
ora random variable say Y„ in a sequence of random variables 7, Y 2 , Y t , 
is any distnbut on that is approximately equal to the actual distribution of Y t 
for large n As an example (see Eq (33)] we say that X„ has an asymptotic 
distribution that is a normal distribution with mean fi x and variance p\\n Note 
that an asymptotic distribution may depend on n whereas a limiting distribution 
does not (for a limiting distribution the dependence on n was removed in taking 
the limit) Yet the two terms are closely related since it was precisely the fact 
that the sequence Z, Z 2 Z. had limiting standard normal distnbu 
lion that allowed us to say that X, hadanasymptoticnormal distribution withmean 
H x and variance e x ln The idea is that if the distribution of Z, is converging 
to <P( 2 ) then for large n the distribution ofZ, must be approximately distributed 
N(Q l) But if Z. - — u*V(<r xA/fO is approximately distributed N(Q 1) 

then X n is approximately distributed N{p x <?l/n) 

In concluding this section we give two further examples concerning sums 
The frst shows how express ng one random variable as a sum of other simpler 
fandom variables is often a useful ploy The second shows how the distribution 
of a sum can be obtained even though the number of terms in the sum is also a 
random variable someth ng that occasionally occurs in practice 


EXAMPLE 13 Consider n repeated independent trials each of which has 
possible outcomes j, a *+ 1 Let p i denote the probability of outcome 

>>j on a particular trial and let X } denote the number of the n trials resulting 
m outcome dj,j- I k + 1 We saw that (X t , , X k ) had a multi 

nomial d stribution Now let 

z _f 1 if ath trial results in outcome <j y 
Jl |0 otherwise. 
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n 

then Xj=J,Z Ja . Now suppose we want to find cov [X t , Xj]. In- 
tuitively, we might suspect that such covariance is negative since when one 
of the random variables is large another tends to be small. 


cov[X,, Xj] = cov 


= l fcov [Z ip , Zv,] 

a = l J /j = j ac i ' 


by Theorem 2. Now if then Z f p and Zj a are independent since they 
correspond to different trials, which are independent. Hence 


2 ILj cov [2 ^ , Z J(X ] - £ cov [%ia > Zjg], 

fi a 1 o«l a— 1 

But cov (Z, a , - f[Z ia Z Ja ] - i5"[Z ffit ]iffZ ;(I ], and <f[Z| a Z ya ] -0 since 

at least one of Z fa and Z Ja must be 0. Now <?[Z,J = p { , and c?[Z y J = p,.; 
so cov [Z, , Xj) » -///?,/>, . //// 


EXAMPLE 14 Let Z lt . Z rt , ... be a sequence of independent and identi- 
cally distributed random variables with mean p x an d variance a x . Let N 

N 

be an integer- valued random variable, and define S N — £ Z,; that is, S N 

i - 1 

is the sum of the first N Z/ s, where A is a random variable as are the 
Z/s. Thus S?j is a sum of a random number of random variables. Let us 
assume that N is independent of the AYs. Then €[S N ] = 
by Eq. (26) of Chap. IV. But S[S n \N « if] - S[X t + • • • + ZJ = np x ; 
so IV] =/V/^, and S[S[Sff\N]] = = /**<?[#] « ftrftr. Sim- 

ilarly, using Theorem 7 of Chap. IV, 

var [Sy] = <?[var [5*1 IV]] + var [tf^l A^j] 

= <?[/Vcr£] -b var [Np x ] 

= a£<?[IV] + /i* var [A] 

Suppose now that Zhas a geometric distribution [seeEq. (14) of Chap. Ill] 
with parameter p , Z/ has an exponential distribution with parameter A, 
and we are interested in the distribution of S N . Further assume inde- 
pendence of N and the Z/s. Now, for z > 0, 

P[S N ^Z]= Y p [ s ^ z \ N =”} p[N = n] 
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(by using the fact that a sum of independent and identically distributed 
exponential random variables has a gamma distribution) 


■X 


WHT , 
(«-»)’ 


- f *p e ~ iu i 

J 0 »«1 

= du = Ip jV l *“ du = 1 - e~ lT ' 


That is, S/i has an exponential distribution with parameter p) Recall 
that (see Eq (14) of Chap III] <?[A1 = 1/p and var [tf] = (1 - p)/p J , also, 
6[X] = 1/1 and var [X] “ If? 2 So, as a check of the formulas for the 
mean and variance derived above, note that 


= p N p x * t - ~ ; 

p A pA 

and 

, 2 , s 1 1 1 - p 1 1 

v* KJ - /■..} + «g* - - JJ + -jr JI - 

which are the mean and variance, respectively, of an exponential distnbu 
tion with parameter pX f j}j 


5 THE TRANSFORMATION Y ~ g(X) 

The last of our three techniques for finding the distribution of functions of 
given random variables is the transformation technique It is discussed in this 
section for the special case of finding the distribution of a function of a um 
dimensional random variable That is for a given random variable X wc seek 
the distribution of Y = g(X) for some function g( ) Discussion of the general 
case is deferred until Sec 6 below Both the notation Y ~ g{X) and the notation 
y ~ 9 {x) will appear in the ensuing paragraphs, y ~ g{x) is the usual notation 
for the function or transformation specified by g{ ), and Y — g(X) defines the 
random variable Y as the function pf ) of the random variable X 

5.1 Distribution of Y = g{X) 

A random variable X may be transformed by some function g{ ) to define a 
new random variable Y The density of Y, f Y (y), will be determined by the 
transformation g{ ) together with the density /*(*) of X 
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First, if A" is a discrete random variable with mass points X 21 •«», 
then the distribution of Y = g{ X) is determined directly by the laws of prob- 
ability. If X takes on the values x u x 2 , . . . with probabilities/.^), AW), 
then the possible values of Y are determined by substituting the successive 
values of X in g(-). It may be that several values of X give rise to the same 
value of Y. The probability that Y takes on a given value, say y J} is 

MYj)= I AW). (35) 


EXAMPLE 15 Suppose X takes on the values 0, 1,2, 3, 4, 5 with prob- 
abiIities/ x (0),/ x (l),A(2),/ x (3),/ x (4), and/ x (5). If Y = g{X) = (X — 2) 2 , 
note that Y can take on values 0, 1, 4, and 9; then / y (0)=/ x (2). 
Ad) = Ad) + A(3), A(4) = A(0) +/ x (4), and A(9) =/ x ( 5). //// 

Second, if X is a continuous random variable, then the cumulative dis- 
tribution function of Y = g{X) can be found by integrating/^*) over the appro- 
priate region; that is, 

Fy(y) =P[7<>’]= mx) <y]= f AW dx. (36) 

J {X‘9<x)£ y) 

This is just the cumulative-distribution-function technique. 


EXAMPLE 16 Let X be a random variable with uniform distribution over the 
interval (0, 1) and let Y = g(X) = X 2 . The density of Y is desired. 
Now 

F Y (y) = P[Y < y] = P[X 2 <y]= [ AM dx = f dx = y/y 

J ix:x*<y\ J 0 

for 0 < y < 1 ; so 

Fyiy) = yfyhoM + hi, «)(>')> 

and therefore 

mi 

Application of the cumulative-distribution-function technique to find the 
density of Y = g(X), as in the above example, produces the transformation 
technique, the result of which is given in the following theorem. 
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Theorem 11 Suppose X is a continuous random variable with prob- 
ability density function ) Set£ = {x f x (x) > 0} Assume that 
(,) y = g{x) defines a one to one transformation of 3£ onto <5 
tu) The detmtra of wvth respect to y \s coatm’Aww and 

nonzero for ye where 9~ i (y) is the inverse function of g{x), that is, 
g~ l (y) is that x for which g{x) = y 

Then V = g{X) is a continuous random variable with density 

My) - U*" 1 '(y)|/*(s“ ») 

proof The above is a standard theorem from calculus on the change 
of variable in a definite integral , so we will only sketch the proof Consider 
the case when X is an interval Let us suppose that g(x) is a monotone 
increasing function over X, that is g(x)>Q t which is true if and only if 
[d}dy)g l (y)>0 over $ For ye <$, F t (y) = P[g{X) <; y) m PJX 
9 \y) \~Fx(9 ‘(yi). and hence My) *■ (dftyFJy) » [(d(dy)g~ t (y)] 
Mff^ly)) ^ ahain rule of differentiation On the other hand, if g(x) 
is a monotone decreasing function over X, so that 9 (x) < 0 and 
(d)dy)g Hy) < 0 then F t (y) m P[g(X) £y] = P[X ^ g'Hy)} « 1 -F x 
(S~’(y» andthcrefore/ r (y)s=-Kd/£fy) 9 “ 1 (y)]/ A ( 9 - , (y» = \(d!dy)g~\y)\ 
Mr'Mytoyey mi 


EXAMPLE 17 Suppose ,Vhas a beta distribution What is the distribution of 
Y «* -log^XI X = {x f x {x) >0} = [x 0 < x < 1} y *=^(x) = -iog,x 
defines a one to one transformation of X onto $ ~{y y >0} 

9 x {y)-*~ f so (djdy)g-\y ) « -e~ r , which is continuous and nonzero 
for y e V By Theorem 1 1, 

/rOO - |^'My)|A(i7- , (y))7 tl {y) 

“ B(a,"fr ) e ~ e ^ ‘A« *>)(y) 

In particular, if b = 1, then B(<z b) = \} a> s o f t {y) *= ^-'V (0 .,00, an 
exponential distribution with parameter a HU 
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EXAMPLE 18 Suppose X has the Pareto density f x (x) = 0* _O-1 / Uito) (;ic) and 
the distribution of 7= iog c X is desired. 

My)- fx{g- l {y))l4y) 

= eWT e -%. a) {e^ = ee-°’l l 0 ta) (y). //// 

The condition that g(x ) be a one-to-one transformation of £ onto 9) is 
unnecessarily restrictive. For the transformation y = g(x), each point in £ 
will correspond to just one point in 9) ; but to a point in 2) there may correspond 
more than one point in £, which says that the transformation is not one-to-one, 
and consequently Theorem 1 1 is not directly applicable. If, however, £ can be 
decomposed into a finite (or even countable) number of disjoint sets, say £ x , . . . , 
so that y = g(x) defines a one-to-one transformation of each £ f into $)> 
then the joint density of Y = g{X) can be found. Let * = gT 1 (y) denote the 
inverse of y — g(x) for x e £j . Then the density of Y = g(X) is given by 

fr(y) = I jsT'W fxigJ'Wviy), (37) 

where Ihe summation is over those values of / for which g(x) = y for some value 
ofxin £,. 


EXAMPLE 19 Let X be a continuous random variable with density /*(•), 
and let Y = g{X) = X 2 . Note that if X is an interval containing both 
negative and positive points, then y = g(x) = x 1 is not one-to-one. How- 
ever, if 3E is decomposed into = {x: x e 3-, x < 0} and Z 2 = { x - x 6 
x^O}, then y = g{x) defines a one-to-one transformation on each £ f . 
Note that g^ ’(>’) = —y/y and^J *(y) = %/v- By Eq. (37), 

AOO = -j=fx(-\fy) + l;-j^fx(Sy)j I ( 0 . ») 00- 

In particular, if 


AW = j 


then 
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or, if 

AM = + lV(-i 2)M 


then 

AUO - [5 ^ \ (-■ 'fy + 1 ^ + 2^9^ 1+ ^ ^] i(0 

+ [^i (1+v/ ^ ) ] /n * ,0 ° 


//// 


5 2 Probability Integral Transform 

If X is a random variable with cumulative distribution F x ( ), then F x { ) is a 
candidate for g( ) in the transformation Y = g(X) The following theorem 
gives the distribution of Y ** F x (X) if F x { ) is continuous Since F x { l is a 
nondecreasing function, the inverse function F x l { ) may be defined for any 
value of y between 0 and 1 as the smallest* satisfying /^(x) 5: > 

Theorem 12 If Xis a random variable with continuous cumulative dis 
tnbunon function F x (x), then V — F x {X) is uniformly distributed over the 
interval (0 I) Conversely, if U is uniformly distributed over the interval 
{0 I) then X = F x 1 {V) has cumulative distribution function F x { ) 

proof P[U < u \ = P[F x {X) £ «! * P[X £ Fj ‘(a)] - F x (Fj ’(«)) = 
u for 0 < u < I Conversely, ^ jc] *» P[F X l (U) s x] *= P[U £ /if*)] 

«F*M //// 

In various statistical applications, particularly in simulation studies it 1 $ 
often desired togenerate values of some random variable X To generate a value 
of a random variable X having continuous cumulative distribution function 
F x { ), it suffices to generate a value of a random variable {/that is uniformly dis- 
tributed over the interval (0, 1) This follows from Theorem 12 since if V is a 
random variable with a uniform distribution over the interval (0, I), then X- 
F x l {U) is a random variable having distribution f x ( ) So to get a value, 
say x, of a random variable X, obtain a value, say u, of a random vamble U, 
compute F x l («) and set it equal to x A value yofa random variable {/is called 
a random number Many computer oriented random number generators are 
available 
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EXAMPLE 20 F x (x) = (I - e~ Xx )I i0 ' a) ( x ). F;'(y) » -(I/A) Jog, (I - j);so 
-(1/A) logjl - I/) is a random variable having distribution (1 - 
I(o, oo )(^) if ^ is a random variable uniformly distributed over the interval 
(°, 1 ). I// i 


The transformation Y — F x {X) is called the probability integral trans- 
formation. It plays an important role in the theory of distribution-free statistics 
and goodness-of-fit tests. 


6 TRANSFORMATIONS 

In Sec. 5 we considered the problem of obtaining the distribution of a function 
of a given random variable. It is natural to consider next the problem of 
obtaining the joint distribution of several random variables which are functions 
of a given set of random variables. 


6.1 Discrete Random Variables 

Suppose that the discrete density function f x x„(*n •••» *„) °f the n- 

dimensional discrete random variable (Xj, ..., X „ ) is given. Let X denote 
the mass points of (X t X„); that is, 

X = {(*, Xn (x l ,...,x„)> 0 }. 

Suppose that the joint density of Y t = g l (X 1 , X n ), ...» Y k = g k (X u X„ ) 

is desired. It can be observed that Y k , ...» Y k are jointly discrete and 

■P[Y, =y,;...; Y k = y k ] =/ y , r*(.Vi»* ••♦!'*) = XAi,...,x„(*i» • ••,*»)> where 

the summation is over those ( jv,, ...,*„) belonging to X for which (y u = 

(0i(*i *„), ■•■,9k(x u • ••,*„))• 

EXAMPLE 21 Let (Xj, X 2 , X 3 ) have a joint discrete density function given 

by 


[x i ,x 2> xf) 

(0,0,0) 

(0,0,1) 

(0, 1, 1) 

(1,0,1) 

(1,1,0) 

(M, i: 

fxt t X 2 * xfau X 2* X l) 

1 

i 

* 

* 

£ 

£ 
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Find the joint density of li ® Si(Xi« X 2 > X 2 ) — X 2 + X 2 + X 2 and 

y 2 =<7 2 (Xi, x 2 , x 3 ) = |x a - x 2 j 

X = {(0 0, 0), (0, 0, 1), (0, 1, 1), (1, 0, 1), (1, 1, 0), (1, 1, 1)} 

/t 1 t 1 (0 0)=A 1 t 1 Xi(0.0.0)«1 ) 

f ri r t 0.l)=fx>x t x,(0,M) = b 

fr, n( 2 0) —fxi x, *j(0, 1. 1) = i“* 

ffi K,( 2 * O =/jTi Xi 1)+Ai Xt Xj(Mt0)«|, 

and 

fu tAO)"^ *, *>(1.1.1) = * III I 

6 2 Continuous Random Variables 

Suppose now that we are given the joint probability density function 
fx xA x n * x *) °f the n dimensional continuous random variable 
{Xi, *, X«) Let 

,*„)/** %f j&, (38) 

Again assume that the joint density of the random variables V, = £i(X„ , X,) 

* F* *> A (X lt , X.) is desired, where A is some integer satisfying I £ k £ n 
Iffccn, we will introduce additional, new random variables y it , = 
5t+i(X 1 X,) , y„ — g a (X 2 , , X.) for judiciously selected functions 

p t+ „ 9 a then we will find the joint distribution of F,, X,, and finally 

we will find the desired marginal distribution of Y x , , y 2 from the joint dis 
tnbution of Y, , Y„ This use of possibly introducing additional random 
variables makes the transformation y l = g t (x t , , x„), . , y H = g m {x x , , x,) 
a transformation from an n dimensional space to an n dimensional space 
Henceforth we will assume that we are seeking the joint distribution of y, =* 
P.CXi, , X„) , Y„ — g n (X x , , X n ) (rather than the joint distribution of 

Y lt * y*) when we have given the joint probability density of Xj. . > X, 

We will state our results first for n =2 and later generalize to n>2 
x s (*i * 2 ) be given Set X ={(*,, x 2 ) f x , Xl {x lt x 2 ) > 0} We want to 
find the joint distribution of Y x ~ g x (X lt X 2 ) and y 3 * p 2 (X t , X 2 ) for known 
functions p,( , ) and g 2 { , ) Now suppose that y 2 *= g x (x ar a ) and y 2 - 
^ 2(^1 * 2 ) defines a one to one transformation which maps X onto, say, ?) 
x t and x 2 can be expressed in terms of y, and y 2 , so we can write, say, X x = 
9 1 ‘(Pi. yj) an d x 2 — gj l (yi, y 2 ) Note that X is a subset of the X x x 2 plane and 
?) is a subset of the y 2 y 2 plane The determinant 
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dx t 8x x 
dy t dy 2 
dxt fa,. 
d}\ dy 2 


(39) 


will be called the Jacobian of the transformation and will be denoted by J. 
The above discussion permits us to state Theorem 13. 


Theorem 13 Let \\ and A\ be jointly continuous random variables 
with density function / Xlf * a (x,,x 2 ). Set X = {(x l ,x 2 ):f XitX2 (x u x 2 )> 0}. 
Assume that: 

(i) y t — x 2 ) and y 2 ~g 2 (x u x 2 ) defines a one-to-one transformation 

of X onto 9). 

(ii) The first partial derivatives of x { l (y l% y 2 ) and x 2 = g 2 l (y u y 2 ) 
are continuous over 9), 

(iii) The Jacobian of the transformation is nonzero for (yj, j r 2 ) e 9). 

Then the joint density of T, « gx(X tt X 2 ) and Y 2 -g 2 (X u X 2 ) is given 
by 

A,, r } (yj* j’:) 552 \ J\fx t ,xSo * '(yt* yi)*9z l (yu y 2 ))iv(yi*yz)' (40) 

proof We omit the proof; it is essentially the same as the derivation 
of the formulas for transforming variables in double integrals, which may 
be found in many advanced calculus textbooks. 9) is that subset of the 
yi)* 2 plane consisting of points (y^ y 2 ) for which there exists a (x Jt x 2 ) e £ 
such that (y Jt y 2 ) = (gj. x u x 2 ) y g 2 (x „ x 2 )). 

A0‘i ♦ yz) ~ i i(y t 3 (}’i f y'z)' 9z 1 O’i t Ti))* HI I 


EXAMPLE 22 Suppose that A', and X 2 arc independent random variables, 
each uniformly distributed over the interval (0, 1). Thcn/ Xu x,(*i> x 2 ) = 
'(o.uWo.oto)- * = {(*„ x 2 ): 0<x t <\ and 0<* 2 <1}. Let 
* 2 ) = ' x i+x 2 and y 2 =ff 2 (x t . x 2 ) = x 1 -x t ; then x t = 
i(y t -y 2 ) = ^r'O’c > 2 ). and x 2 - jO’i + ^ 2 ) -9i ‘(.Pi. yi)- 


dx, dx x 


1 _1 

dy t dy 2 


i 2 

8x 2 Sx, 


A 4- 

8)’i dy 2 


1 2 
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FIGURE 3 


X and <0 are sketched m Fig 3 Note that the boundary x 2 = 0 of X goes 
into the boundary $(yj — y 2 ) = 0 of *0, the boundary JTj ** 0 of X goes into 
the boundary i(y, + y 2 ) - 0 of §), the boundary x x - 1 of X goes into the 
boundary \(y 2 - y 2 ) - I of ?), and the boundary x 2 = 1 of X goes into 
the boundary^! +y 2 ) = 1 of 9) Now the transformation is one-to one, 
the first partial derivatives of andpj' are continuous, and the Jacobian 
is nonzero, so 

fy, t,{ yi.y 3 )= |Jl/r 1 ,jr J (s7 , 0’i.ys).Bi , 0 ? i.i’j)) 



a (i for (yj,y 2 )e$ 

\0 otherwise //// 

EXAMPLE 23 Let X, and X 2 be two independent standard normal random 
variables Let Y l ~X l + X 2 and Y 2 = XjX 2 . Then 

= and x 1 -gi 1 (yi>y2)=~~> 

* + yi i + yi 

y% y\ 

i+J'j (i + y*)* yi(yi + 0 Xi 

i y t "(i+j'a)* ~*U + y 2 ) l ‘ 

1 + Xa 0 + y 2 f 
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fr u ?i O'i* 3^) 


« JiiL. 1 cxp [ _ ! f o^L , 

(l+j'j) J 2n P [ 2 1(1 + y 2 Y + (1 + y 2 ) 2 


( 1 
-exp -- 

t- 2 {_( 


-i jzd cxp r ui + jm \ 

2 * (1 +>’ 2 ) 2 P L 2 ( 1 + j 2 ) 3 J - 


To find (he marginal distribution of, say, X 2 , we must integrate out j’,; 
that is 


./VjO’i) ~ J /r,. rjO'i- )':) dy t 


I 1 

2rr (1 + y 2 ) 




i (i+j’i) , 
was 2(H^F^ 


and so 


a Cauchy density. That is, the ratio of two independent standard normal 
random variables has a Cauchy distribution . till 


j (I + )'l) t 

(TTJ^ y ' y ' 


1 1 


EXAMPLE 24 Let X, have a gamma density with parameters n , and X for 
/ = I, 2. Assume that A", and X 2 are independent. Again, we seek the 
joint distribution of Y x — X t + X 2 and Y 2 — XJX 2 • 

*i = 5r , 0 , i»X2)=^“ and x 2 = g; 1 (y»yi)= t+£* 
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hence 


\A 


Vi 

(i +yi) 2 


fy ra<yi yj) 


yi_ 

(l + yj) 2 r(n, 


=fwfw / ' ,e aT^'‘° - M ‘“ 

r 


Lr(n 4 + n 2 ) 


y t + " 1 -‘ « «»(yi) 


LB(«i 


v«7) (i + >-,r ,+ni 1(0 "^J 


We see that / Kt y ,O t y 2 ) =/r,(yi)/r 1 (yj) > so and 7 2 are independent 
Also we see that the distribution of Y t -X t + X 2 is a gamma distnbu 
tion with parameters n, + n 2 and l If Oj =» n 2 = 1, then Y 2 ts the ratio 
of two independent exponentially distributed random variables and has 
density 


AM) 


1 

“(i+y*) 1 


f(o 


a density which has an infinite mean 


(III 


EXAMPLE 25 Let Xj have a gamma distribution with parameters r>t 
and X for i = 1, 2, and assume X t and X 2 are independent Suppose now 
that the distribution of Y : = XJ(Xi + X 2 ) is desired We have only 
the one function y t = x 2 ) = X\Kx 2 + x 2 ), so we have to select the 

other to use the transformation technique Since x, and x 2 occur in the 
exponent of their joint density as their sura, x 2 + x 2 is a good choice 
Let y 2 = x, + x 2 , then x t = y 2 y 2 , x 2 = y 2 -y,y 2t and 


J 


yi yi 
-y 2 i-y,| 
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Hence 

/ri.i'jO’n y 2 ) 
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yz Tin,) r(n 2 ) A " ,+ " , C3'tJ'a)" , “ , (ya - ^iJ 2 )" 2_1 e~ A %(j If y 2 ) 



[b(«!, n 2 ) 1(1 

r A n,+nj i 

It turns out that Y, and Y 2 are independent and Y, has a beta distribu- 
tion with parameters n, and n 2 . //// 


Of the three conditions that are imposed on the transformation y x = 
0i( x i> x z) ar >d }’2 = 9 zi x u x 2 ), the sometimes restrictive condition that the 
transformation be one-to-one can be relaxed. For the transformation y, — 
9i( x i> x z) and y 2 — g 2 (x,, * 2 ), each point in 3c will correspond to just one point 
in 3) ; but to a point in 3) there may correspond more than one point in 3c, which 
says that the transformation is not one-to-one and consequently Theorem 13 as 
stated is not applicable. If, however, 3t can be decomposed into a finite number 
of disjoint sets, say X u ..., X m , so that y, = g,{x„ x 2 ) and y 2 = g 2 {x u x 2 ) 
define a one-to-one transformation of each X t onto 3) then the joint density of 
Y, X 2 ) and Y 3 =g 2 {X i> X 2 ) can be found. Let x } =gj, 1 {y s , y 2 ) 

and x 2 = g 2 i {Yu y 2 ) denote the inverse transformation of ?) onto X t for i = 1, 
.... m, and set 


dgj? dg ,! 1 

dy, dy 2 

dgh 1 dg 2i l 
dy, dy 2 


Theorem 14 Let X, and X 2 be two jointly continuous random variables 
with density function /x,,x 2 (*i> * 2 )- Assume that X can be decomposed 
into sets X„ ..., X m such that the transformation y, =9i(x „ x 2 ) and 
y 2 = g 2 (x t , x 2 ) is one-to-one from X, onto 3). Let x, =ffu 1 (yu y 2 ) and 
x 2 =g 2i 1 (y l , y 2 ) denote the inverse transformation of 3) onto X t , i — 
1, ..., m. Assume that all first partial derivatives of g u l and g 2t 1 are 
continuous on 2) and that J< does not vanish on 3), / = 1, . . . , m. Then 
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fr r/j'i. J'a) - S W/r, x.&h'G'.. J'a). 0u*(fu J'JVnO'i. J'a) (41) 
We illustrate this theorem with Example 26 


EXAMPLE 26 Assume that X t and X 2 are independent standard normal 
random variables Cotistder t he tran sformation y s » xj + x\ and 
y a = x 2 , which implies x i = ±a/ yi — T| and x 2 =» y 2 so that the tram 
formation is not one to one Here T = {(x |t x 2 ) — oo<x, <co, 

-co < jr 2 < oo}, and $ «{(y t , y 2 ) 0£y t <oo, -Jy t <y t Ky/yJ 
If X is decomposed into £j and X 2 . where X , = {(*„ x 2 ) 0 <;x, <oo, 
-co < x 2 < co} and X 2 =* {(x t , x 2 ) -eo < x t < 0, -co < x 2 < oo) (ut 
the terminology of Theorem 14, m =■ 2), then our transformation is one- 
to one for X, onto <1), 1 = 1 2 07/(y„ y 2 ) = y/yT-Vl .and^J^y,,^) a 
J'a. rfj'O'i. J'a) - -VtT-I’I . and tfaaOv J'a) = >'j , so 


_ v *r* 

dy, 

I 


iO t - yi) 




-K», -/,)-< &£ 

dy 2 


s _I(y»-y|r*. 


Hence 

/r, r»Oi, J'a) = [|/, I/*, ar^ufo. J'a). ffaVG'i. y 2 )) 

+ l^al/Vi x.feiaG'i.J'a). J'a)))J|)0’i. yi) 

- _ 1 _ i ,-tn 

yy,_y| 2n 

for y t 0 and - s /y l <y 2 < ,/y, Now 


fr&i) *= J_ / r , r,GWi)<tya 


1 , i*"* 

2 ir J.y; 


1 


i d yi 


-Jjly/yT- 
r 

“S e '*"(i + l)-5‘'*’' *» *>«• 

an exponential distribution 


//// 
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Theorems 13 and 14 can be generalized from n ~ 2 to n >2. We will 
state the generalization of Theorem 14. (Theorem 13 is a special case of Theorem 
14.) 


Theorem 15 LetXj,X 2 ,...,.X n be jointly continuous random variables 

with density function f x Xn (x u . . . , x„). Let £ = {(*„ ..., x n ): 

fx x(*i •••> x r) > 0}. Assume that X can be decomposed into sets 

ij, . . . , such that y i — gfxj, x„), y 2 = 9 i{x\, . . jc n ), y„ — 

g„{x j, x a ) is a one-to-one transformation of onto 5 , 1=1 

Let Xi *0n , O'n •••’ Tn)> x n = 9ni'(yu •••> )’ n ) denote the inverse 
transformation of ?) onto Jj,i=l m. Define 


dffu 

SQu 1 

d 9u 

dy 1 

dy 2 

dy„ 

dd2i l 

ddu 1 

dgii 1 

dy 1 

3y 2 

Sy n 

sg-n 

3g w 1 

dg;, 1 

8y 1 

dy 2 

dy n 


for / = 1 , ..., m. 

Assume that all the partial derivatives in J, are continuous over 
9) and the determinant A is nonzero, i = 1, Then 


fy rfy 1 > • • • > y’n) 

= 2 ••• » Xn)» •**’ 0h> • * • » 3 , «)) 

i= 1 

tor(y u ..., y n ) in ?). 


(42) 

//// 


EXAMPLE 27 Let X u X 2 , and X 3 be independent standard normal random 
variables, y, = x„ y 2 = (x, + x 2 )/2, and y 3 = (x, +x 2 + x 3 )/3. Then 
x, = y„ x 2 = 2 y 2 - y„ and x 3 = 3 y 3 - 2 y 2 ; so the transformation is 

one-to-one. {in = 1 in Theorem 15.) 


J = 



1 0 0 
-12 0 
0-2 3 
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fr, rj rjCvi* yi> yO 

= 6 (JL) 3 exp{— Ity* + (2 y» - yi) 1 + (3y» - 2y a > a l} 

= exp [- i(2y| - 4y ,y 2 + By\ - I2y } y 3 + 9y|)] 

The marginal distributions can be obtained from the joint distribution, 
for instance, 

/*,(*) 

- /" /_ fr, r, ,&i, n . yf) tot *y» 

« - I2y 2 y 3 + 9AfJ 

x (J_^cxp[-K2y? - 4y,y 2 + 2yl)]dy^ dy 2 

= ”72 / /vl-ttfij* - 12 yj Fi + ^D^pI-lPy!)] in 

= (v / 3/ v ^n)cxp[-|j']]. 

that is, Jj is normally distributed with mean 0 and variance $ //// 


PROBLEMS 

1 (a) Let X„ A'a, and X, be uncorrelated random variables with common variance 

q* Find the correlation coefficient between X, + X t and X t + X» 

(b) Let X, and X, be uncorrelated random variables Find the correlation 
coefficient between X , + X, and X, — X, in terms of var (A'jJ and var (AiJ 

(c) Let X it X , , and X% be independently distributed random variables with 
common mean ft and common variance or 1 Find the correlation coefficient 
between Xy~ X , and X% - X t 

2 Prove Theorem 2 

3 Let X have c d f F*( ) =*F( } What in terms of F( ) is the distribution of 

*/ t » max [0,*11 

4 Consider drawing balls, one at a time, without replacement, from an urn containing 
M balls, K of which are defective. Let the random variable X{Y) denote the 
number of the draw on which the first defective (nondefective) ball is obtained 
Let Z denote the number of the draw on which the rth defective ball is obtained, 
(n) Find the distribution of X 
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(b) Find the distribution of Z. (Such distribution is often called the negative 
hypergeomctric distribution.) 

(c) Set 5 and K =2. Find the joint distribution of A'and 7. 

5 Let Xu X n be independent and identically distributed with common density 

fx{x ) — X 2 1(i, «>(*). 

Set y=* min [Xu X n ], Does <j[X i] exist? If so, find it. Does <f[7] exist? 
If so, find it. 

6 Let X and Y be two random variables having finite means. 

(a) Prove or disprove: [max [A', Y ] ] ^ max [<?[ X], 6[ 7]]. 

(b) Prove or disprove: <?[max [A; 7] -f- min [A; 7]] = <f[A3 -f cf [7]. 

7 The area of a rectangle is obtained by first measuring the length and width and 

then multiplying the two measurements together. Let X denote the measured 
length, Y the measured width. Assume that the measurements X and Y are 
random variables with joint probability density function given by f x , r(x, y) — 
A7 f .9L, i.jwjOOi where L and IP are parameters satisfying L;> W> 0 

and k is a constant which may depend on L and W . 

(a) Find S[XY] and var [XY]. 

(b) Find the distribution of XY. 

8 If X and Y are independent random variables with (negative) exponential dis- 
tributions having respective parameters and A 2 , find <5“[max [X, 7]]. 

9 Projectiles are fired at the origin of an xy coordinate system. Assume that the 

point which is hit, say ( X , 7), consists of a pair of independent standard norma! 
random variables. For two projectiles fired independently of one another, let 
(Xu Yt) and (X 2 , Y 2 ) represent the points which are hit, and let Z be the distance 
between them. Find the distribution of Z 2 . Hint: What is the distribution of 
(X 2 — Xi) 2 1 Or(Y 2 ~Yt) 2 l Is (X 2 — Xi) 2 independent of (Y 2 7j) 2 ? 

10 A certain explosive device will detonate if any one of n short-lived fuses lasts 
longer than .8 seconds. Let X { represent the life of the /th fuse. It can be as- 
sumed that each X\ is uniformly distributed over the interval 0 to 1 second. Fur- 
thermore, it can be assumed that the AVs are independent. 

(a) How many fuses are needed (i.e., how large should n be) if one wants to be 
95 percent certain that the device will detonate? 

(b) If the device has nine fuses, what is the average life of the fuse that lasts the 
longest? 

11 Suppose that random variable X K has a c.d.f. given by [(n — 1 )ln] $(*) *F (1 /n)F„(x) t 
where O (•) is the c.d.f. of a standard normal and for each n F«(‘) is a c.d.f. What 
is the limiting distribution of X/l 

12 Let Af and 7 be independent random variables each having a geometric distribu- 
tion, 

*(a) Find the distribution of Xl(X+ 7). [Define X&X+ Y) to be zero if 
Af-f- 7=0.] 

(b) Find the joint moment generating function of X and A r + 7. 
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13 Let X, and Xi be independent standard normal random variables Let Y t =. x t 
+X> and Yt = X} + Xi 

(a) Show that the joint moment generating function of Yt and Yt is 


for -eo <ti <« and — od </, <J 

1 -2tt 

(6) Find the correlation coefficient of Yt and Y t 
14 Let X and Y be independent standard normal random variables Find the m g f 
Of AT 

jf 5 Suppose that X, and X t are independent random variables, each having a standard 
normal distribution 

(a) Find the joint distribution of (.Xt + XJ/Vl and (Xt — X, )/V2 

(b) Argue that 2X, X 2 and XI- XI have the same distribution Hint 


XI -X* -2 


Xt + XiX 2 - Xt 

Vi Vi 


16 A dry bean supplier fills bean bags with a machine that does not work very well, 
and he advertises that each bag contains 1 pound of beans In fact, the weight 
of the beans that the machine puts into a bag is a random variable with mean 
16 ounces and standard deviation 1 ounce If a box contains 16 bags of beans 

(a) Find the mean and variance of the weight of the beans In a box 

(b) Find approximately the probability that the weight of the beans in a box 
exceeds 250 ounces 

(c) Find the probability that two or fewer underweight (less than 16 ounce) bags 
are in the box if the weight of beans in a bag is assumed to be normally 
distributed 

17 Numbers are selected at random from the interval (0, ]) 

(a) If 10 numbers are selected, what is the probability that exactly 5 are less than 

1? 

(b) If 10 numbers are selected on the average how many are less than i ? 

(c) If 100 numbers are selected, what is the probability that the average of the 
numbers is less than i? 

18 Let X, denote the number of meteors that collide with a test satellite during the 
/th orbit Let S, = 2 Af t , that is, S', is the total number of meteors that collides 

with the satellite during n orbits Assume that the X,’s are independent and 
identically distributed Poisson random variables having mean A 

(а) Find d[S.] and var [5,] 

(б) If n *= 100 and A = 4, find approximately P(S, 0 # > 440) 

19 How many light bulbs should you buy if you want to be 95 percent certain that 
you will have 1000 hours of light if each of the bulbs is known to have a lifetime 
that is (negative) exponentially distributed with an average life of 100 hours? 

(a) Assume that all the bulbs are burning simultaneously 

(b) Assume that one bulb is used until it burns out and then it is replaced, etc 
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20 ( fl ) If * j src independent and identically distributed gamma random 

variables, what is the distribution of X x -f • - + x n ? 

ib) If X t% .♦.* X* are independent gamma random variables and if Xi has param- 
eters n and A t /~ 1 , h, what is the distribution of X x 4 - *** -f X„7 

21 (#) If X Xt •••« Xn are independent identically distributed geometric random 

variables, what is the distribution of X t 4 -f X„? 

{b) If X Xf .m, Xn are independent identically distributed geometric random 
variables with density 6(1 - what is the distribution of 

Xt + — + Xnl 

(c) If X tf Z, are independent identically distributed negative binomial 
random variables, what is the distribution of X x -f ••• 4 - X n *l 

(d) If Xiy Xn arc independent negative binomial random variables and if X x 

has parameters r, and p , what is the distribution of X x 4 -f X n 1 

*22 Kitty Oil Co. has decided to drill for oil in 10 different locations; the cost of 
drilling at each location is S10.000. (Total cost is then S100,000.) The prob- 
ability of finding oil in a given location is only \ , but if oil is found at a given 
location, then the amount of money the company will get selling oil (excluding the 
initial S10,000 drilling cost) from that location is an exponential random variable 
with mean $50,000. Let Y be the random variable that denotes the number of 
locations where oil is found, and let Z denote the total amount of money received 
from selling oil from all the locations. 

(a) Find S[Z], 

(b) Find P[Z > 100,000] F = 1] and P[Z> 100,000) F = 2]. 

(c) How would you find P[Z > 100,000]? Is P[Z > 100,000] > £? 

23 If Xu • Xk arc independent Poisson distributed random variables, show that 

the conditional distribution of X x , given Xi 4 b is binomial. 

*24 Assume that X u ... f X k +t are independent Poisson distributed random variables 
with respective parameters A J$ A k 4 i„ Show that the conditional distribution 
of Xu Xk given that X x 4 - ••• + X k + x =// has a multinomial distribution 
with parameters n, A f /A, ...» A k /A, where A — Aj + 4 -A k 4 i. 

25 If X has a uniform distribution over the interval ( — tt/ 2, W2), find the distribution 
of 7 — tan X, 

26 If X has a normal distribution with mean and variance a 2 , find the distribution, 
mean, and variance of Y — e*. 

27 Suppose X has c.d.f. F*(x)=exp[-<r“-' '"]• What is the distribution of 

r=ex P i-(x-*ypn 

28 Let X have density 

W* a > b) = Bih) uTxF 1 u M 

where a > 0 and b > 0. (This density is often called a beta distribution of the 
second kind.) Find the distribution of Y— 1/(1 + X). 

29 If X has a uniform distribution on the interval (0, 1), find the distribution of MX. 

Does IX] exist ? If so, find it. 



216 DISTRIBUTIONS OF functions OF RANDOM VARIABLIS 


JO (a) Give an example of a distribution of a random variable X for which 6[[jX) 
is not finite 

(6) Give an example of a distribution of a random variable X for which *?[1/A'] a 
finite, and evaluate 6[l/X] 

31 If /,(*) = 2xe *7,o •,,(*). find the density of Y « X 1 

32 If X has a beta distribution, what is the distribution of 1 — XI 

33 If /,(*) = « *Ao find the distribution of X/0 + X) 

34 If fx(x) — 1M1 + Jc a ) find the distribution of IfX 

33 If /,(*) ~ 0 for x £0, find the density of Y= aX x + b in terms of /,( ) for a > 0 

36 If X has the Weibull distribution as given in Eq (42) of Chap III, what is the 
distribution of Y ~ aX'l 

37 {a) Let Y^X* and f,(x) = ,>(*). 0>0 Find the ftd f. of X and Y 

Find the density of Y 

( b ) Let Y—X* and /,(*) = (|0)/< < *)(*), $>0 Find the c.d f and density 
of Y 

38 If X and }' are independent random variab les, e ach having the same geometric 
distribution, find the distribution of Y— X 

39 If X and Y are independent random variables, each having the same negative 
exponential distribution, find the distribution of Y— X. 

40 If X, Y, and Z are independent random variables, each uniformly distributed over 
(0 1), what is the distribution of XYfZl 

41 Assume that X and Y are independent random variables, where X has a p d f 
given by Mx) *= 2xl l0 ,,(*) and Y has a p d f given by/,(y) =2(1 ->)/,« n(y) 
Find the distribution of X+ Y 

*42 Let X and Y be independent Poisson distributed random variables Find the 
distribution of Y — X 

43 ir/,00 = /, Q , >(*), find the density of Y*=3X+t 

*44 Let X and Y be two independent beta-distributed random variables Is XY always 
beta distributed? If not, find conditions on the parameters of X and Y that will 
imply that XY is beta distributed 

43 If/, ,(x y)±=e <*♦«/,„ .>(*)/<» mfo}. find the density ofZ = (Z+ Y){ 2. 

46 If/, r(sr, y) = Axye «,,(* )/„> «»(y), find the density of V X 1 + Y 2 . 

47 If/, ,(jt, y) **4 xyl l9 i>(x)/«, „(y), find the joint density of X * and Y 1 . 

48 If/, ,(ar, y) «= 3jr/ (0 „(y)/ (< , „(*), find the density of Z = X- Y. 

49 If/, Or) = [(1 + ar)/2]/,_, ,,(.,), find the density of Y= XK 

30 If fx ,(•*■, y) = 7 )0 i n (y), find the density ofZ, where 

z=cx+ Y)t,. m ,/z+ rH(z+ y- ly* ux± n 

31 If/, »(* >) = e «,(* )7,q ®j(y), find the joint density of Zand Z+ Y 

32 If fx r Ax ,y. 2 )*=e-‘*^«/ (0 ^ find tbe den sity 0 f their 

average (Z+ Y+Z)I3 

S3 If X, and X 2 are independent and each has probability density given by 
Ae- 1 */,, «u(jr), find the joint distribution of Y t = X l (X 1 and Yi = Xt 4- Xi and 
the marginal distributions of y, and Yi 
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*54 Let X 2 and X 2 be independent random variables, each normally distributed with 
parameters p. = 0 and a 1 = 1. Find the joint distribution of Yi = XI + Xi and 
Y 2 = Xi/Xi . Find the marginal distribution of Y, and of Y z . Are Y, and Y 2 
independent? 

55 If the joint distribution of X and Y is given by 

fx. yCv, y ) = 2e“ <x+7) / [0t>3 (x)/ [ 0, «,(y), 

find the joint distribution of X and X+ Y. Find the marginal distributions of 
XandX+ Y. 

56 Let/*, rCx, y) = h(x 4* y)f( o, u(»v)/(o, i;Cv)/<o, u(x 4* y). 

(fl) Find /*(•). 

(6) Find the joint and marginal distributions of X+ 7 and Y— X. 

57 Suppose /*. n , 2 (:r, y | z) = [z 4- (1 - z){x 4- y)]/ <0 , „(jc)/ {0 . i ,0) for 0 ^ z ^ 2, and 
fz( Z ) = JflO, 2 j(2). 

(a) Find £[X+ Y], 

(b) Are X and Y independent? Verify. 

(c) Are X and Z independent ? Verify. 

(d) Find the joint distribution of X and X+ Y. 

(e) Find the distribution of max [ X , Y]\Z = z . 

(/) Find the distribution of (A"4- 7)] Z~z. 

55 A system will function as long as at least one of three components functions. 
When ail three components are functioning, the distribution of the life of each is 
exponential with parameter JA. When only two are functioning, the distribution 
of the life of each of the two is exponential with parameter JA; and when only one 
is functioning, the distribution of its life is exponential with parameter A. 

{a) What is the distribution of the lifetime of the system? 

(b) Suppose now that only one component (of the three components) is used at a 
time and it is replaced when it fails. What is the distribution of the lifetime 
of such a system? 

59 The system in the sketch will function as long as component C% and at least one 
of the components Cz and C 2 functions. Let Xt be the random variable denoting 
the lifetime of component C*, / = 1, 2, and 3. Let 7= max [X 2 , X 2 ] and Z== 
m\n[X lt 7]. Assume that the XiS are independent (negative) exponential 
random variables with mean 1. 

(ct) Find S[Z] and var [Z]. 

( b ) Find the distribution of the lifetime of the system. 
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60 A system, which is composed of two components, will function as long as at least 
one of the two components functions When both components are operating 
the lifetime distribution of each is exponential with mean I However, the dis 
tnbution of the remaining lifetime of the good component after one fails is 
exponential with mean 3 (The idea is that after one component fails the other 
component carries twice the load and hence has only half the expected lifetime ) 
Find the lifetime distribution or the system 

* 61 Suppose that (X F) has a bivariate normal distribution Find the joint distnbu 
tion of aA'-b b Y and cX + dY for constants a b, c, and d satisfying od-fce^O 
Find the distribution of aX+ b Y Hint Use the moment generating function 
technique and see Example 7 

62 Let Xi and X> be independent standard normal random variables Let U be 
independent of Xi and X t and assume that V is uniformly distributed over(0, J) 
Define Z = VX x +(1 — V}X x 
(a) Find the conditional distribution of Z given U = it 
(fc) Find 6 [Z J and var [Z J 
* (c) Find the distribution of Z 



VI 

SAMPLING AND SAMPLING DISTRIBUTIONS 


1 INTRODUCTION AND SUMMARY 

The purpose of this chapter is to introduce the concept of sampling and to present 
some distribution theoretical results that are engendered by sampling. It is a 
connecting chapter — it merges the distribution theory of the first five chapters 
into the statistical theory of the last five chapters. The intent is to present 
here in one location some of the laborious derivations of distributions that are 
associated with sampling and that will be necessary in our future study of the 
theory of statistics, especially estimation and testing hypotheses. Our thinking 
is that by deriving these results now, our later presentation of the statistical 
theory will not have to be interrupted by their derivations. The nature of the 
material to be given here is such that it is not easily motivated. 

Section 2 begins with a discussion of populations and samples. It ends 
with the definitions of statistic and of sample moments . Sample moments are 
important and useful statistics. Section 3 is devoted to the consideration of 
various results associated with the sample mean . The law of large numbers 
and the central-limit theorem are given, and then the exact distribution of the 
sample means from several of the different parametric families of distributions 
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introduced in Chap III is given Sampling from the normal distribution is con 
sidered m Sec. 4, where the chi square, F, and t distributions are defined Order 
statisticsarcdiscussed m the final section, they, like sample moments, are impor 
tant and useful statistics 


2 SAMPLING 
2 I Inductive Inference 

Up to now we have been concerned with certain aspects of the theory of prob- 
ability including distribution theory Now the subject of sampling brings us 
to the theory of statistics proper, and here we shall consider briefly one important 
area of the theory of statistics and its relation to sampling 

Progress in science is often ascribed to experimentation The research 
worker performs an experiment and obtains some data On the basis of the 
data certain conclusions are drawn The conclusions usually go beyond the 
materials and operations of the particular experiment In other words, the 
scientist may generalize from a particular experiment to the class of all similar 
experiments This sort of extension from the particular to the general j$ called 
tnducUie inference It is one way in which new knowledge is found 

Inductive inference is well known to be a hazardous process In fact, it 
is a theorem of logic that m inductive inference uncertainty is present One 
simply cannot make absolutely certain generalizations However, uncertain 
inferences car be made, and the degree of uncertainty can be measured if the 
experiment has been performed in accordance with certain principles One 
function of statistics is the provision of techniques for nuking inductive in- 
ferences and for measuring the degree of uncertainty of such inferences Un- 
certainty is measured in terms of probability, and that is the reason wt have 
devoted so much time to the theory of probability 

Before proceeding further wc shall say a few words about another kind of 
inference — deductive inference While conclusions which are reached by induc- 
tive inference arc < nly probable those reached by deductive inference are con 
elusive To illosti ate deductive inference, consider the following two statements 

(0 One or the interior angles of e3ch right triangle equals 90° 

00 Triangle A is a right triangle 

If we accept these two statements, then we are forced to the conclusion 
(m) One of the angles of triangle A equals 90° 
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This is an example of deductive inference, which can be described as a 
method of deriving information [statement (iii)] from accepted facts [statements 

(i) and (ii)]. Statement (i) is called the major premise , statement (ii) the minor 
premise , and statement (iii) the conclusion . For another example, consider the 
following: 

(i) Major premise: All West Point graduates are over 18 years of age. 

(ii) Minor premise: John is a West Point graduate. 

(iii) Conclusion: John is over 18 years of age. 

West Point graduates is a subset of all persons over 18 years old, and John 
is an element in the subset of West Point graduates; hence John is also an element 
in the set of persons who are over 18 years old. 

While deductive inference is extremely important, much of the new knowl- 
edge in the real world comes about by the process of inductive inference. In the 
science of mathematics, for example, deductive inference is used to prove the- 
orems, while in the empirical sciences inductive inference is used to find new 
knowledge. 

Let us illustrate inductive inference by a simple example. Suppose that 
we have a storage bin which contains (let us say) 10 million flower seeds which 
we know will each produce either white or red flowers. The information which 
we want is: How many (or what percent) of these 10 million seeds will produce 
white flowers? Now the only way in which we can be sure that this question 
is answered correctly is to plant every seed and observe the number producing 
white flowers. However, this is not feasible since we want to sell the seeds. 
Even if we did not want to sell the seeds, we would prefer to obtain an answer 
without expending so much effort. Of course, without planting each seed and 
observing the color of flower that each produces we cannot be certain of the 
number of seeds producing white flowers. Another thought which occurs is: 
Can we plant a few of the seeds and, on the basis of the colors of these few 
flowers, make a statement as to how many of the 10 million seeds will produce 
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white flowers 7 The answer is that we cannot make an exact prediction as to 
how many white flowers the seeds will produce but we can make a probabilistic 
statement if we select the few seeds in a certain fashion This is inductive 
inference We select a few of the 10 million seeds, plant them, observe the 
number which produce white flowers, and on the basis of these few we make a 
prediction as to how many of the 50 million will produce white flowers, from a 
knowledge of the color of a few we generalize to the whole 10 million We 
cannot be certain of our answer, but we can have confidence in it m a frequency- 
ratio probability sense 

2 2 Populations and Samples 

We have seen in the previous subsection that a central problem in discovering 
new knowledge in the real world consists of observing a few of the elements 
under discussion and on the basis of these few we make a statement about the 
totality of elements We shall now investigate this procedure in more detail 

Definition 1 Target population The totality of elements which are under 
discussion and about which information is desired will be called the target 
population (HI 

In the example in the previous subsection the 10 million seeds in the stor- 
age bm form the target population The target population may be all the dairy 
cattle in Wisconsin on a certain date the prices of bread in New York City on a 
certain date the hypothetical sequence of heads and tails obtatned by tossing a 
certain com an infinite number of times, the hypothetical set of an infinite 
number of measurements of the velocity of light, and so forth The important 
thing is that the taTget population must be capable of being quite well defined, 
it may be real or hypothetical 

The problem of inductive inference is regarded as follows from the point 
of view of statistics The object of an investigation is to find out something about 
acettain target population It is generally impossible or impractical to examine 
the entire population but one may examine a part of it (a Sample from it) and, 
on the basis of this limited investigation, make inferences regarding the entire 
target population 

The problem immediately anses as to how the sample of the population 
should be selected We stated in the previous section that we could make prob- 
abilistic statements about the population if the sample is selected in a certain 
fashion Of particular importance is the case of a simple random sample, 
usually called a random sample, which is defined m Definition 2 below for any 
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population which has a density. That is, we assume that each element in our 
population has some numerical value associated with it and the distribution of 
these numerical values is given by a density. For such a population we define 
a random sample. 

Definition 2 V Random sample Let the random variables X ll X 2 ,...,X„ 
have a joint density f Xl Xh {- % .... -/that factors as follows: 

fxux 3 xS x i* x 2 > •••> **) f{x n \ 

where /(•) is the (common) density of each X t . Then X u X 2 , X n is 
defined to be a random sample of size n from a population with density 

ft* //// 

In the example in the previous subsection the 10 million seeds in the stor- 
age bin formed the population from which we propose to sample. Each seed is 
an element of the population and will produce a white or red flower; so, strictly 
speaking, there is not a numerical value associated with each element of the 
population. However, if wc, say, associate the number 1 with white and the 
number 0 with red, then there is a numerical value associated with each element 
of the population, and we can discuss whether or not a particular sample is 
random. The random variable X { is then 1 or 0 depending on whether the /th 
seed sampled produces a white or red flower, / = 1, , . . , Now if the sampling 
of seeds i s performed i n suchajwa v that the random variables X ty X„ are 
independent and have the same density^ then, according to Definition 2, the 
sa mple is called random. 

An important part of the definition of a random sample is the meaning of 
the random variables X p . . . , X n . (The random variable X { is a representation 
forthe numericalvalue tliatTHcTth item (oreiement) sampled will assume. After 
the sample is observed, the actual values of X ly . . . , X n are known, and as usual, 
wc denote these observed values by x lf x n . Sometimes the observations 
...» x„ are called a random sample ifx 1? . x n are the values of X u ...» X n , 
where X lf X n is a random sample])) 

Often it is not possible to select a random sample from the target popula- 
tion, but a random sample can be selected f^om some related population. To 
distinguish the two populations, we define sampled population. 

Definitions Sampled population Let X 2 , X 2 i ..** X n be a random ^ 
sample from a population with density /(*); then this population is called v 
the sampled population. ^ II II 
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Valid probability statements can be made about sampled populations on 
the basis of random samples, but statements about the target populations are 
not valid jn a relative frequency probability sense unless the target population 
u also the sampled population We shall give some examples to bring out the 
distinction between the sampled population and the target population. 


EXAMPLE 1 Suppose that a sociologist desires to study the religious habits 
of 20-year old males in the United States He draws a sample from the 
Ofl-ytai-oW malts of a large city to make bis study In this case the targtt 
population is the 20-year-old males in the United States, and the sampled 
population is the 20-year old males in the city which he sampled He can 
draw valid relative frequency probabilistic conclusions about his sampled 
population, but he must use his personal judgment to extrapolate to the 
target population and the reliability of the extrapolation cannot be 
measured in relative-frequency probability terms //// 

EXAMPLE 2 A wheat researcher is studying the yield of a certain variety cf 
wheat m the state of Colorado He has at his disposal five farms scat 
tered throughout the state on which he can plant the wheat and observe 
the yield The sampled population consists of the yields on these five 
farms whereas the target population consists of the yields of wheat on 
every farm tn the state //// 

This book will be concerned with the problem of selecting (drawing) a 
sample from a sampled population with density /( ) and on the basis of these 
sample observations probability statements will be made about /( ), or Infer- 
trices about /( ) will be made 


Remark We shall sometimes use the statement ** population /( )" to 
mean* a population with density /( )" When we use the word ‘popula 
tion’ without an adjective * sampled ’ or “target,” we shall always mean 
sampled population //// 


2 3 Distribution of Sample 
'•/ Definition 4 Distribution of sample Let X ir X % , 
of size rt The distribution of the sample X v , X„ 
joint distribution of A",, , X„ ,C £ QJ 


, X K denote a sample 
is defined to be the 
//// 


Suppose that a random variable X has a density /( ) in some population, 
and suppose a sample of two values of X, say 3 nd js drawn at random 
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Xi is called the first observation, and x 2 the second observation. The pair of 
numbers x 2 ) determines a point in a plane, and the collection of all such 

pairs of numbers that might have been drawn forms a bivariate population. 
We are interested in the distribution (bivariate) of this bivariate population in 
terms of the original density /( •). The pair of numbers (jq, x 2 ) is a value of the 
joint random variable {X X9 X 2 \ and X Xy X 2 is a random sample (of size 2) 
from /(•)• _By definition of random sample, the joint distribution of X x and' 
X 2 > which we call the distribution of our random sample of size 2, is given by 
fxuX z(*l> *2) =f(x x )f(x 2 l 

As a simple example, suppose that X can have only two values, 0 and 1, 
with probabilities q—l—p and p, respectively. That is, X is a discrete ran- 
dom variable which has the Bernoulli distribution 

' o) 

The joint density for a random sample of two values from /(•) is 

fxuxl* y =f( x i)f(x 2 ) =p Xl * X2 q 2 ~ Xl - X2 I { 0t i}(^)/ (0i X)(x 2 ). (2) 

It is to be observed that this (bivariate) density is not what we obtain as the 
distribution of the number of successes, say 7, in drawing two elements from a 
Bernoulli population. The density of 7 is given by 

fr(y) = (fyp y q 2 - r for y = 0,1, 2. 

The single random variable 7 equals X x -f X 2 . 

It should be noted that f Xu xj<x^ x 2 ) gives us the distribution of the sample 
in the order drawn. For instance, in Eq. (2), fx u x$* 1) = pq Tefers to the 
probability of drawing first a 0 and then a 1. 

Our comments for a random sample of size 2 generalize to a random sam- 
ple of size n, and we have the following remark. 

Remark^/ X 2 , . . . , X„ is a random sample of size n from /(•), then 
the distribution of the random sample X l9 . . . , X n , defined as the joint dis- 
tribution of X i9 . X n> is given by •••» x n) /CO* 

II 1 1 

Note that again this gives the distribution of the sample in the order drawn. 
Also, note that if X u . . . , X„ is a random sample, then X l9 . . . , X„ are stochasti- 
cally independent. 

We might further note that our definition of random sampling has auto- 
matically ruled out sampling from a finite population without replacement since, 
then, the results of the drawings are not independent. 
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2.4 Statistic and Sample Moments 

One of the central problems in statistics is the following It is desired to study a 
population which has a density /( , 0) where the form of the density is known 
but it contains an unknown parameter 0 (if 0 is known, then the density func- 
tion is completely specified) The procedure is^to take a random sample 
v X 2 V. of size n from this density and let the value of some function, 
say A-*i *j, , jcJ, represent or estimate the unknown parameter 0 The 

problem is to determine which function will be the best one to estimate 0 This 
problem will be formulated in more detail in the next chapter In this section 
we shall examine certain functions, namely, the sample moments, of a random 
sample First however, we shall define what we shall mean by a statistic, 

\f Definition 5 Statistic A statistic is a function of observable random 
variables which is itself an observable random variable, which does not 
contain any unknown parameters //// 

The qualification imposed by the word “observable" is required because 
of the way we intend to use a statistic (“Observable” means that we can observe 
the i alues of the random variables ) We intend to use a statistic to make ifl 
Terences about the density of the random variables and if the random variables 
were not observable, they would be of no use in making inferences 

For example if the observable random variable X has the density tf>, ,i(x), 
where p and a 2 are unknown, then X - n is not a statistic, neither is X[<7 (since 
they are not functions of the observable random variable X only— they contain 
unknown parameters) but X, X+ 3 and X 2 + log X 2 are statistics 

In the formulation above, one of the central problems in statistics is to 
find a suitable statistic (function of the random variables JF„ X 2 , . , X„) to 
represent 0 


EXAMPLE 3 If X l% , X n is a random sample from a density /( , 0), then 
(provided A,, , X„ are observable) 



is a statistic and 

}{min [A'„ , X„] + max {X „ . X,)} 

is also a statistic lf/(x, 0) = ,(*) and 0 is unknown, X„ -disnota 

statistic since it depends on 0, which is unknown Ml 
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Next we shall define and discuss some important statistics, the sample 
moments. 

Definition 6 ^Sample moments Let Xi,X 2 ,...,X„bsa . randmn safnple 
from the density /(•). Then the rth sample moment about 0, denoted by 
M' r , is defined to be ' ** 1 - 


1 ” 

M’ r = r I X[. 

ni= [ 


(3) 


In particular, if r — 1, we get the sample mean, which is usually denoted by 
X or X n ; that is, 

(4) 

Also, the rth sample moment about X„, denoted by AT,, is defined to be 


WJ-in-rJ &) 

1 — llll 


Remark Note that sample moments are examples of statistics, //// 


We will consider in detail some properties of the sample mean in Sec. 3 
below. 

In Chap. II we defined the rth moment of a random variable X , or the rth 
moment of its corresponding density /*(•)> to be £[X r ] = //' . We could say 
that £[X r ] is the rth population moment of the population with density f{x) = 
AW* lWe shall nov7 show that The sample moments reflect the population 
moments in the sense that the expected value of a sample moment (about 0) 
equals the corresponding population moment. Also, the variance of a sample 
moment will be shown to be (1 /n) times some function of population moments/) 
The implication is that for a given population the values that the sample moment 
assume will tend to be more concentrated about the corresponding population 
moment for large sample size n than for small sample size. Thus a sample 
moment can be used to estimate its corresponding population moment (provided 
the population moment exists). 


Theorem 1 jjet 


... X„ be a random sample from a population 


with a density. /(•)• The expected value of the rth sample moment (about 
0) is equal-to the rth population momentijhat is, 

= #) (if Hr exists). (6) 
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Also, 

(if Mi, exists) (7) 

' £(M y ) * M ' r 

vai[M,) = vat 


-«W} 

=;[& 4 -(*,)'] 

In particuliPif r ==TTwe get the following corollary //// 

Coiollary Let A - , A 2 , X„ be a random sample fTom a density /( ), 
and let X„ — - Y X, be the sample mean , then 

ft ii 

6[ JP,] = M and var [A’,] = - o\ (8) 

n 

where /i and o 7 are, respectively, the mean and variance of /( III} 

As we mentioned earlier, properties of the sample mean will be studied 
in detail in the next section 

Theorem 1 gives the mean and variance in terms of population moments of 
the rth sample moment about 0, a S'milar, though more complicated, result 
can be derived for the mean and variance of the rth sample moment about the 
sample mean We will be content with looking only at the particular case 

r ■» 2, that is, at Mt ~ (1 }n) £ (X, — X) 7 A/ 2 is sometimes called the sample 

variance, although we will take Definition 7 as our definition of the •sample 
variance * 
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^Definition 7_ Sample variance Let X lt X 2 ,..., X n be a random sample 
Tfoma density /(0; then 

S , 2 = S 2 = ^Y.t(^-^) 2 for n > 1 (9) 

is defined to be the sample variance . \ //// 


The reason for taking S 2 rather than M 2 as our definition of the sample 
variance (both measure dispersion in the sample) is that the expected value of 
S 2 equals the population variance?) 

The proof of the following remark is left as an exercise. 


Remark 


S 2 = S 2 - 


Xj)2 


llll 


Theorem 2 / Let X t , X 2 , X n be a random sample from a density 
/(•)» and let 


S 2 = — r I(Xi-Z) 2 - 

n — 1 j=i 



and 



for n > 1. (10) 


proof (Only the first part will be proved.) Recall that cr 
= and p r = d'KA' - /*) r ]. We commence by noting and prov- 

ing an identity that is quite useful. 

t (X; - p) 2 = Wi - *) 2 + 

/=i 1=1 

since 

x ( X , - a 0 2 = K * i - *+ / 0 2 = I P ' 1 - *) + (*- /*)] 2 

= X [(*, - *) 2 > 2 ( x i " #0 + (*- rf 2 ] 

= X (*, - ^) 2 + 2(J-p)X (^.- - JO + «(*- $ 

= X(*j- *) 2 + «(^-a) 2 - 
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Using the identity of Eq (II), we obtain 

= *[ t &, - p) j - «(* - ri 1 ] 

- “j {£ fKXt-tf} - nS[(X - /i) 2 ]} 

- | Z ffi ~ n var ffl] 

Although the derivation of the formula for the variance of S ! can 
be accomplished by utilizing the above identity (Eq (11)1 an d 



such derivation is lengthy and is omitted here only to be relegated to 
the Problems //// 

\ Sample moments are examples of statistics that can be used t o estimate ) 
their population counterparts, for example, Af' estimates, pj, X estimates p, 
and S 2 estimates a 5 _) In each case, we are talcing some function of the sample, 
which we can observe, and using the value of that function of the sample to 
estimate the unknown population parameter 


3 SAMPLE MEAN 

The first sample moment is the sample mean defined to be 

</x= x t , 

ni* i 

where X t , X 2 , , X„ is a random sample from a density /( ) Xis a function 

of the random variables X lt , X„, and hence theoretically the distribution of 
X can be found In general, one would suspect that the distribution of X 
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depends on the density /(•) from which the random sample was selected, and 
indeed it does. Two characteristics of the distribution of X, its mean and 
variance, do not depend on the density/( •) per se but depend only on two charac- 
teristics of the density ,/"(*)• This idea is reviewed in the following subsection, 
while succeeding subsections consider other results involving the sample mean. 
The exact distribution of X will be given for certain specific densities /(•). 

It might be helpful in reading this section to think of the sample mean X 
as an estimate of the mean /z of the density /(•) from which the sample was 
selected. We might think that one purpose in taking the sample is to estimate 
/r with X. 


3.1 Mean and Variance 

Theorem 3 Let X u X 2 , • • . , X„ be a random sample from a density /(•), 

n 

which has mean p and finite variance cr 2 , and let X = ( J ///) Then 

6[X] = nx - p and var [ X] = o\ = I o 2 . (12) 

n 

llll 

Theorem 3 is just a restatement of the corollary of Theorem 1. In light 
of using a value of X to estimate p, let us note what Theorem 3 says. S[X] — p 
says that on the average X is equal to the parameter p being estimated or that 
the distribution of X is centered about p. var[X] — (\jn)a 2 says that the 
spread of the values of X about p is small for a large sample size as compared to 
a small sample size. For instance, the variance of the distribution of X for a 
sample of size 20 is one-half the variance of the distribution of X for a sample of 
size 10. So for a large sample size the values of X (which are used to estimate 
p) tend to be more concentrated about p than for a small sample size. This 
notion is further exemplified by the law of large numbers considered in the next 
subsection. 


3.2 Law of Large Numbers 

Let/(* ; 0 ) be the density of a random variable X. We have discussed the fact 
that one way to get some information about the density function /(*; 0) is to 
observe a random sample and make an inference from the sample to the popula- 
tion. If 0 were known, the density functions would be completely specified, 



232 SAMPLING 


I SAMPLING DinWWHONS 


and no inference from the sample to the population would be necessary There- 
fore, it $eems that we would like to have the random sample tell us something 
about the unknown parameter 0 This problem will be discussed in detail in 
the next chapter In this subsection we shall discuss a related particular 
problem 

Let e[X] be denoted by /i m the density /( ) The problem is to estimate 
p. in a loose sense, €[X\ is the average of an infinite number of values of the 
random variable X In any real world problem we can observe only a finite 
number of values of the random variable X A very crucial question then is 
Using only a finite number of values of X (a random sample of size n, say), can 
any reliable inferences be made about 6[X], the average of an infinite number of 
values of X"> The answer is * yes* , reliable inferences about £\X\ can be made 
by using only a finite sample, and we shall demonstrate this by proving what is 
called the w eak law of large numbers In words, the law states the following A 
positive integer n can be determined such that if a random sample of size n or 
larger is taken from 3 population with the density /( ) (with <?[/T]=p), the 
probability can be made to be as close to I as desired that the sample mean X 
will deviate from p by less than any arbitrarily specified small quantity More 
precisely, the weak law of large numbers states that for any two chosen small 
numbers f and <5, where e > 0 and 0 < 5 < 1, there exists an integer n such that 
if a random sample of size n or larger is obtained from /( ) and the sample 
mean denoted by X,, computed, then the probability is greater than 1 — 5 
(i e as close to 1 as desired) that X m deviates from p by less than £ (i e , is ar- 
bitrarily close to /i) In symbols this is written For any e > 0 and 0 < <5 < 1 
there exists an integer n such that for all integers m^n 

The weak law of large numbers is proved using the Chebyshev inequality given 
in Chap II 

Theorem 4 Weak law of large numbers Let/( ) be a density with mean 
H and finite variance a 1 , and let X m be the sample mean of a random 
sample of size n from /( ) Let z and 5 be any two specified numbers 
satisfying t > 0 and 0 < 5 < 1 If n is any integer greater than c^/e 2 5, 
then 

P[-t<X m -p<t\^\-6 ( 13 ) 

proof Theorem 5 in Subsec 4 4 of Chap II stated that P[g{X) £ k \ 
£ ff[g(X)\!k for every k > 0, random variable X, and nonnegative func- 
tion g{ ) Equivalently, P{g{X) < k] > l - S[${X)]jk 
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Let g(X) =(!„ - n) 2 and k = e 2 ; then 

P[-£ < X n -fi < e] =P[| x n ~n\ < e] 


= P[|X n -^| 2 < E 2 ]>i 


m* 2 . 

5 ^ 


I — (5 


*t(*„ - aO 2 ] 


for S > a 2 /nc 2 or n > a 2 fc 2 5. 


mi 


Below are two examples to illustrate how the weak law of large numbers 
can be used. 


EXAMPLE 4 Suppose that some distribution with an unknown mean has 
variance equal to 1. How large a sample must be taken in order that the 
probability will be at least .95 that the sample mean X„ will lie within .5 
of the population mean? We have c 2 = 1, e = .5, and 5 = .05; therefore 



,05(.5) 


80. 


1111 


EXAMPLE 5 How large a sample must be taken in order that you are 99 
percent certain that X n is within .5cr of fil We have c = .5tr and 5 = .01. 
Thus 


n > — 


1 


<5e 2 .01(.5)V .01 (.5)' 


= 400. 


III! 


We have shown that by use of a random sample inductive inferences to 
populations can be made and the reliability of the inferences can be measured in 
terms of probability. For instance, in Example 4 above, the probability that 
the sample mean will be within one-half unit of the unknown population mean 
is at least .95 if a sample of size greater than 80 is taken. 

3.3 Central-lipit^fheorem 

Althotmli^eixave already stated the central-limit theorem in our study of 
distribution theory in Chap. V, we will repeat it here in our study of the sample 
mean X because it gives the asymptotic distribution of X. At the outset of this 
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section ve radicated that we were interested in the distribution of X Th«„ 
central limit theorem, which is one of the most important theorems in all of 
probability and statisttps-'tells us approximately how X is distributed 

s/Q? 

Theorems | Ce ntraHtmit theorem Let /( ) be a density with mean gaud 
the sample mean of a random sample of 
n froprfl ) Let the random variable Z m be defined by 


Z„ = 


X„ - X. - 


Then, the distribution of Z „ ; 
as n approaches infinity 



oljn 


(14) 


lc standard normal distribution 

//// 


Theorem 5 tells us that the limiting distribution of Z, (which is X a stand- 
ardized) is a standard normal distribution, or it tells us that X„ itself is ap- 
proximately or asymptotically distributed as a normal distribution with mean 
H and variance <r 2 /«_J 

The astonishing thing about Theorem 5 is theyact that nothing is said 
about the form of the original density function jWnatever the distribution 
function provided only that it has a finite vartancerThe sample mean will hare 
approximately the normal distribution for large samples The condition that 
the variance be finite is not a critical restriction so far as applied statistics is 
concerned because in almost any practical situation the range of the random 
variable will be finite, in which case the variance must necessarily be finite - ^ 
The importance of Theorem 5, as far as practical applications are con- 
cerned is the fact that the mean X n of a random sample from any distribution 
with finite variance a 2 and mean ;i is approximately distributed as a normal 
random variable with mean n and variance a 2 jn 

We shall not be able to prove Theorem 5 because it requires rather ad 
vaneed mathematical techniques However, in order to make the theorem 
plausible, we shall outline a proof for the more restricted situation in which 
the distribution has a moment generating funclionr^The argument will be 
essentially a matter of showing that the moment generating function for the 
sample mean approaches the moment generating function for the norma) 
distribution] — ' 

Recall thatjthe moment generating funclton of a standard nornal dis- 
tribution is given by «*'* (See Subsec 3 2 of Chap III ) Let m(t) = <?*’’ 
Let rrtzSf) denote the moment generating function of Z„ It is our purpose to 
show that must approach m(t) when n, the sample size, becomes large 
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- Now 


z .(0 - SV’-I = <zj . fa (, £z£)] , fa (J 2 £=*)] 

vMv7j^)]-M a >{-T;~)} 


using the independence of * lf .... X n . Now if we let Y { = {X t - fifa then 
m ri(0> moment generating function of Y is independent of / since all Y { 
have the same distribution. Let m y (t) denote m Yl (t); then 



The rth derivative of evaluated at t = 0 gives us the rth moment about 

the mean of the density /(•) divided by so we may write 




and since = Q)and // 2 = o 2 , this may be written 


m 


typ'd 


t < + J LfL» 

12 3.\/»* 3 


r + 


J_/u 

4!/; a 4 


r + 


(17) 


Now Iim (I + u/n) n = e*' 2 , where u representsJhe_expr£ssioruwithin-liie_paren- 

n~* oo " — 

theses in Eq. (I T). We have Jim m 7 Jt) = e if \ so tha t in the l imit^Jias the 


same moment generating fu nction as a standard normal and, J>y a theorem 
similar to Theorem 7 in Chap. II, has the same distribution , j j 

The degree of approximation depends, of course, on the-sample size and 
on the particular density /(•). The approach to normality is illustrated in 
Fig. 2 for the particular function defined by f{x) = e~ x 7 (0 j J-v). The solid 
curves give the actual distributions, while the dashed curves give the normal 
approximations. Figure 2 a gives the original distribution which corresponds to 
samples of I ; Fig. 2b shows the distribution of sample means for n — 3; Fig. 2c 
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gives the distribution of sample means for/r = 10 Thepupves rather exaggerate 
the approach to normality because they cannot shoytfvnat happens on the tails 
of the distribution Ordinarily distributions of 'sample means approach 
normal ty fairly rapidly with the sample size in the region of the mean, but mare 
slowly at points distant from the mean, usually the greater the distance of a 
point from the mean the more slowly the normal approximation approaches the 
actual distribution 

In the following subsections we will give the exact distribution of the 
sample mean for some specific densities /( ) 

3 4 Bernoulli and Poisson Distributions 

If X, Xj X„ is a random sample from a Bernoulli distribution we can 
find the exact distribution of X„ (YVe know that X„ is approximately normally 
distributed ) The density from which we are sampling is 

/(x)=p*(i-py-*/ t0 t ,(x> 

V>e know (see Examp e 9 of Chap V) that ]T X t has a binomial distribution, 
that is 

.,w. 

hence the distribution of X, is given by 



for* = 0 I, ,« 


(18) 
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So X n , the sample mean of a random sample from a Bernoulli density, takes on 
the values 0, l//z, 2/n, . . . , 1 with respective binomial probabilities 



A- 1 , 





P n q ° • 


If Xi, X„ is a random sample from a Poisson distribution with mean 
2, then also has a Poisson distribution with parameter nA(see Example 10 
of Chap. V), and hence 


p\x n =^=p\tx,=k = 

L nj Lr=i J 


o 


(«A)* 


kl 


for fc = 0,1,2,..., 


(19) 


which gives the exact distribution of the sample mean for a sample from a 
Poisson density. 


3.5 Exponential Distribution 

Let X t , X 2 , - . . , X n be a random sample from the exponential density 

/(x) = 0 e -%, „,(*). 

tt 

According to Example 1 1 of Chap. V, £ X% has a gamma distribution with 

i 

parameters n and 0 ; that is, 

o. „)(*), 

or 

P[I AT, < y] = £ dz for y > 0, 

and so 

p\l„< -1 = f for y>0. 

L nj J o r(n) 

Or, 

^^Cw/' ,6 ' e '" dz 

- Cm **""*** 1 

that is, X„ has a gamma distribution with parameters n and n9 . 
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3 6 Uniform Distribution 

Let x„ , X, be a random sample from a uniform distribution on the interval 
(0, 1 ] The exact density of JT, is given by 

ajm - (Jr,? [<“>■" - (”) ( “ - ’i"' + (")<“- 2 >"‘- ... 

+ (- ( 20 ) 

The derivation of the above (using mathematical induction and the convolution 
formula) is rather tedious and is omitted Instead let us look at the particular 
cases n = 1, 2, 3 

/*>(*) = /jcOO m Ao ijWt 

f Xl {x) ~ 2(2 x)/ (0 „<x) + 2[2x - 2(2x - l)]/„ „(x) 

J4x for 0 < x £ 4 

(4(1 -x) for i <x£ 1, 
and 

/f,W - “ (3x) 2 / ( o u (x) + ^(3x) 2 - Q(3 x - l) 2 ]f ( | ,,w 

+ l[o *) 3 - Q Ox - 1)* + Q(3* - 2 ) J ] ; <I i A) 

t 1 * 2 for 0 < x g J 

- 27[ t ’ t — (x — i) 2 ] for*<x<;$ 

,VO “ x) 2 for $ < x £ I 

/*,(*) A,(jc), and /*,(x) are sketched in Fig 3, and an approach to normality 
can be observed (In fact, the inflection points of fx,(x) and of the normal 
approximation occur at the same points') 

We have given the distribution of the sample mean from a uniform distri- 
bution on the interval (0, 1 ), the distribution of the sample mean from a uniform 
distribution over an arbitrary interval (a, 6] can be found by transformation 

3.7 Cauchy Distribution 

Let A 1 ,. , X, be a random sample from the Cauchy density 
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then X n has this same Cauchy distribution for any n . That is, the sample mean 
has the same distribution as one of its components. We are unable to easily 
verify this result The moment-gcnerating-function technique fails us since the 
moment generating function of a Cauchy distribution does not exist Mathe- 
matical induction in conjunction with the convolution formula produces 
integrations that are apt to be difficult for a nonadvanced calculus student to 
perform. The result, however, is easily obtained using complex-variable 
analysis. In fact, if we had defined the characteristic function of a random 
variable, which is a generalization of a moment generating function, then the 
above result would follow immediately from the fact that the product of the 
characteristic functions of independent and identically distributed random 
variables is the characteristic function of their sum. A major advantage of 
characteristic functions over moment generating functions is that they always 
exist. 




\ — ^ 

4/^AMPLING FROM .rflE^-NORMAL DISTRIBUTIONS 


4.1 Role of the ! 


ialT5istribution in Statistics 


It will be found in the ensuing chapters that the normal distribution plays a very 
predominant role in statistics'. Of course, the central-limit theorem alone 
ensures that this will be the case, but there are other almost equally important 


reasons. 
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In the first place, many populations encountered in the course of research 
in. many fields seem to have a normal distribution to a good degree of approxima- 
tion It has often b^n argued that this phenomenon is quite reasonable in 
view of the central limit theorem We may consider the firing of a shot at a 
target as an illustration The course of the projectile is affected by a great 
many factors, all admittedly with small effect The net deviation is the net 
effect of all these factors Suppose that the effect of each factor is an observa- 
tion from some population, then the total effect is essentially the mean of a set 
of observations from a set of populations Being of the nature of means, the 
actual observed deviations might therefore be expected to be approximately 
normally distributed We do not intend to imply here that most distributions 
encountered in practice are normal, for such is not the case at all, but nearly 
normal distributions are encountered quite frequently 

Another consideration which favors the normal distribution is the fact 
that sampling distributions based on a parent normal distribution are fairly 
manageable analytically In making inferences about populations from 
samples it is necessary to have the distributions for various functions of the 
sample observations The mathematical problem of obtaining these distribu- 
tions is often easier for samples from a normal population than from any other, 
and the remaining subsections of this section will be devoted to the problem of 
finding the distributions of several different functions of a random sample from 
a normally distributed population 

In applying statistical methods based on the normal distribution, the 
experimenter must know, at least approximately, the general form of the distri- 
bution function which his data follow If it is normal, he may use the methods 
directly if it is not, he may sometimes transform his data so that the transformed 
observations follow a normal distribution When the experimenter does not 
know the form of his population distribution, then he may use other more 
general but usually less powerful methods of analysis called nonparametnc 
methods Some of these methods will be presented in the final chapter of this 
book 


4 2 Sample Mean 

One of the simplest of all the possible functions of a random sample is the 
sample mean, and for a random sample from a normal distribution the dis- 
tribution (exact) of the sample mean is also normal This result first appeared 
as a special case of Example 12 in Chap V It is repeated here 
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n ^ c h e sar n pi c mea n of a random sample of size n 
^jn^rnaLd.istnbiilio n -\vj t h^ean-^an^^ Then X n has 

a norm al distribution with mean /z and variance a 2 In . ^ 

PROO^ To prove this theorem we shall use the moment-generating- 
function technique, 

"»x.(0 = <?[cxp /X„] = <f |exp 


which is t he moment g ener ating function of a normal distribution with 
meaTT/Tand variance c 2 jruj ~ 


L 


HU 


Since we have the exact distribution of in considering estimating /z 
with X n , we will be able to calculate, for instance, the (exact) probability that 
our ‘‘estimator” X n is within any fixed amount of the unknown parameter /z. 



4.3 /The Chi-Square Distribution 

The ngnniifdistnbution has two unknown parameters /z and a 2 . In the previous 
subsection we found the distribution of which “estimates” the unknown /z. 
In this subsection, we seek the distribution of 

n - 1 i=i 

which “estimates” the unknown a 2 . A density function which plays a central 
role in the derivation of the distribution of S 2 is the chi-square distribution. 

Definition 8 (^hhsquarc distribution If X ^is a^andona^yar iable with 
density ' 


V 




(21) 
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then X is defined to have a chi-square distribution wi thk degrees of freedom, 
or the density given in Eq (21) is called a chi-square density with k degrees 
of freedom, where the parameter k, called the degrees of freedom, is a posi- 
tive integer //// 


Remark We note that a chi-square density is a particular case of a 
gamma density with gamma parameters r and X equal, respectively, to 
Jfc/2 and { Hence, if a random variable X has a chi-square distribution, 

W) 


r i -iW r j it/2 ^ 

-lir7j -Inrg > 1151 

v". ^ "" 

Theorem 7 i^If therandom variables X t ,i = 1,2 are normally and 
^ependenjlyliistTibuted with means /i, and variances a}, then * 




'distribution Now 


has. » chi squart-d istribution with k degrees of freedom 

proof Write Z ( ~(X, — ftjfo ,, then Z { has a standard normal 

«u(0 = <?[«P tU] - £ Z?)] ] 

“ *[n«p tz i\ *= <z?j 


’ 1 

* a/2jt 




1 r*° Vl- 2t 

— 2t ./2n 




'Tra for ,< 5* 
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the latter integral being unity since it represents the area under a normal 
curve with variance 1/(1 - 2/). Hence, 


n<?[exp tZf] = J] —A 

f=i <= i — 2t 



for 



th e moment generating function of a chi-square distribution with k 
de grees of freedom. * / m 

Corollary If X u . . . , X„ is a random sample from a normal distribution 

rt 

with mean \i and variance c 2 , then U = £ (X i - ft) 2 /a 2 has a chi-square 

rv ,S= 1 

distribution with n degrees of freedom^ //// 

We might note that if either fz or <r 2 is unknown, the U in the above 
corollary is not a statistic. On the other hand, if /z is known and cr 2 is unknown, 

we could estimate cr 2 with (1//;) X (X t — /i) 2 jnote that <?jo/«) X (^i ~ /0 2 j = 

(l/n) X — /0 2 ] = (I//i) X a2 — ff 2 )* and find the distribution of 
r=i t - 1 J 

(1/n) X (^ t — /0 2 by using the corollary. 
i=l 

Remark \Jn v/oras, Theorem 7 says, “the sum o£the squares of inde- 
pendent standard normal random variables has a chi-square distribution 
\vith#legrees of freed om'€qu^\ to the number of terms in the sum.” //// 

\. 

/TheoremjS^lf^Zri Z 2 » •••. Z„ is a random sample from a standard 
Jj ronn^flistpbution, m en: ' 

(5) Zriias a normal distribution with mean 0 and variance 1 /«. 


// 



Zand £ (Z,- - Z) 2 are independent. 

(iii) £ (Z, - Z) 2 has a chi-square distribution with /i - 1 degrees 


/ 


A= I 

freedom. 


proof (Our proof will be incomplete.) (i) is a special case of 
Theorem 6. We will prove (ii) for the case n = 2. If n = 2, 


2 = 51 +^ 
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2 ^ 


(z t -z 2 y , (z x -z t y 
4 4 k 

(Zs-Zi) 1 . 


so Z is a function of Z y + Z 2 , and £ (Z t - 2l*«a function of 2, - Z t ; 
so to prove 7, and £ (Zj — Z) 2 are independent, it suffices to shtfWthat 
Zj + Z 2 and Z 2 - Z A are independent. Now 

m Zi +zS.h) “ <?£e' l < Zl+Zl) ] = 

“ exp J/} exp } = exp rj, 

and, similarly, 

/ = exp . 

Also, 


mz.+z, *-*(*!. t 2 ) - ^e" CZ,+Zj)+ ' l(Zj “ Zl ^ 

- € i(fi-«j) j € i(«i+n>* 5 _ CX p {2 CX p fl 


and since the joint moment generating function factors into the product 
of the marginal moment generating functions, Z y +Z 2 and Z 2 — Z t ’are 
independent ' ' - 


/ To prove (m), we accept the independence of Z and Y (Z, - T) 1 for 

/ arbitrary n Let us note that [ Y Zj,^ f f Zi -2 4- Z} 2 = Y (Z, - Z) 2 4- 
2Z J (Z, - Z) + £ Z 2 « Y (Z7-Z) 2 + nZ 2 ; also £ ( ^TZ )* and nZ^. 
are independent; hence * 


So, 


™iz,*(0 = m £ ( Z ,-z>»(0t»»,2i(0 


Wz(2,-Z|i(0 = 


™ tz ,.(0 ( 1 /q - 21))" 11 

mMO (1/(1 - 2i))* 


r j 

a-2tj 


/< 1/2 
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noting that >/mZ has a standard normal distribution implying thatmZ 2 
has a chi-square distribution with one degree of freedom. We have 
shown that the moment generating function of £ (Z, - Z) 2 is that of a 
chi-square distribution with n - 1 degrees of freedom, which completes 
the proof/) / j/j 

Theorem 8 was stated for a random sample from a standard normal dis- 
tribution, whereas if we wish to make inferences about /z and a 2 , our sample 
is from a normal distribution with mean /i and variance cr 2 . (Let X l9 X n 
denote the sample from the normal distribution with mean \i and variance a 2 ; 
then the Z, of Theorem 8 could be taken equal to (X t — n)jc. 

(i) of Theorem 8 becomes: 

(iO Z = {l /n) - /z)/o = (X — fi)j ( t has a normal distribution with 

mean 0 and variance l/n. 

(ii) of Theorem 8 becomes : 

(HO Z = ( Z - Me and £ (Z, -Z) 2 = £ [(Z, - n)Ja - (Z - p)/a ] 2 = 
£ [(Z, — Xfja 2 ] are independent, which implies Z and £ (X, — Z) 2 are 
independent. 

(iii) of Theorem 8 becomes: 

(in') £ (Zj — If = £ [(Z, — Z) 2 /a 2 ] has a chi-square distribution with 
n — 1 degrees of freedom^) 


Corollary If S 2 = [l/(n - 1)] £ (Z, - Z) 2 is the sample variance of a 

random sample from a normal distribution with mean p and variance cr, 
then 




(n-l)S 2 

a 2 


(24) 


has a chi-square distribution -with n — 1 degrees of freedom. 

proof This is just (in'). Ml 

Remark Since S 2 is a linear function of £/ in Eq. (24), the density of S 
can be obtained from the density of U. It is 




>,-n("-i) /2 1 

k ' 2u 2 / r[(n - 1 )/2] 


> ,(«-3)/2 e -(»-lW2»V ( o > „)0'). 


(25) 

llll 
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Remark The phrase "degrees of freedom " can refer to the number of 
independent squares in the sum For example, the sum of Theorem 7 has 
k independent squares but the sum m (ui) of Theorem 8 has only n - 1 
independent terms since the relation £ (Z, - Z) .= 0 enables one to com- 
pute any one of the deviations Z , — 2, given the other n- l of them //// 

AH the results of this section apply only to normal populations In fact, 
it can be proved that for no other distributions (t) are the sample mean and 
sample variance independently distributed or (li) is the sample mean exactly 
normally distributed 


4 4 The F Distribution 

A distribution the ^distribution which we shall later find to be of considerable 
practical interest js the distribution of the ratio of two independent cht-squate 
random variables divided by their respective degre e s of freedom W e suppose 
that V and V are independently distributed with chi squanTdistributions with 
m and n degrees of freedom respectively Their joint density is then [see 

Eq (21)] 


■ A " ( “ -> w 

v, — ~~ 

We shall find the distribution of the quantity 

v Vim 

Vln’ 


(26) 

(27) 


which is sometimes referred to as the variance ratio To find the distribution 
of X, we make the transformation X - (U(m)l(V'(n) and F*= K, obtain the 
joint distribution of X and Y, and then get the marginal distribution of X by 
integrating out the y variable The Jacobian of the transformation is (n»/it)j>, so 


fx y) = 


r(m/2)r(n/2)2<' 






and 

A00 = f fxt(x.y)dy 

J o 



■iwl5 


_ H(m + n)/2] /m\ n> 
“r(m/2)r(n/2) \n) 


H + (m/«)jt]<" + *w 2 7(0 « )W 


(28) 
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^ Definition 9 F distribution If X is a random variable having density 
I given by Eq. (28) , then X is defined to be an redistributed random variable 
/ ^T^I±^tces~bf freedom m and n. * ” *HH' 

The order in which the degrees of freedom are given is important since the 
density of the F distribution is not symmetrical in m and n. The number of 
degrees of freedom of the numerator of the ratio m/n that appears in Eq. (28) 
is always quoted first. Or if the F-distributed random variable is a ratio of two 
independent chi-square-distributed random variables divided by their respective 
degrees of freedom, as in the derivation above, then the degrees of freedom of the 
chi-square random variable that appears in the numerator are always quoted 
first. 

We have proved the following theorem. 


Theorem 9 Let U be a chi-square random variable with m degrees of 
freedom Jet V be ajch [-squa re ran dom variable with n degrees of freedom^ 
and let U and V^bt independent. Then the random variable 


X = 



is distribu ted as an F distribution with m and n degrees of fr eedom. The 

density of A" is given in Eq. (2 8). - — 7777 

The following corollary shows how the result of Theorem 9 can be useful 
in sampling. 


Corollary If X l% .... X m+l is a random sample of size m + 1 from a 
normal population with mean n x and variance a 2 , if Y iy Y n + { is a 
random sample of size n + 1 from a normal population with mean and 
variance cr\ and if the two samples are independent, then it follows 

that (1 l<r 2 ) m f! X) 2 is chi-square distributed with m degrees of 

i 

n I 

freedom, and (l/o’ 2 )^ (Yj — F) 2 is chi-square-distributed with n degrees 

i 

of freedom; so that the statistic 

Iir,- 

lOO-F) 2 /* 

has an F distribution with m and n degrees of freedom. II 1 1 
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We close this subsection with several further remarks about the F dis 
tnbution 

Remark/ If X is an F distributed random Yauablejvith m and n degrees 
of freedom, then 


{m " >A ' / &) 

proof At firstit might be surprising that the mean'Sepends only 
on the degrees of freedom o f the denomin ator Write X as in Eq (27), 
that is 

y Vim 

x ~w 


But <?[U] = m by Eq (22), and 

= _L_ fiV ' 3 Cv<*-We-**do 
r(n/2)W J 0 

n(n-2)/2] m^/lV (,, “ 3,/1 1 . 


and so 


Hnl 2) 


rr‘ 


The variance formula is similarly derived 


mi 


Remark If X has an F distribution with m and n degrees of freedom, 
then IfX has an F distribution witf^ n and m degrees of freedom This 
result allows one to table the F distribution for the upper tail only For 
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example, if the quantile £.95 is given for an F distribution with m and n 
degrees of freedom, then the quantile for an ./distribution with n and 
m degrees of freedom is given by l/{.„ . In general, if X has an F dis- 
tribution with m and n degrees of freedom and Y has an F distribution 
with n and m degrees of freedom, then the pth quantile point of X, 
is the reciprocal of the (1 -p)th quantile point of Y, ^_ p ,as the following 
shows: 

but 

1 —p = P[Y <, 
so 

//// 

‘•p 

Remark If X is an /'-distributed random variable with m and n degrees 
of freedom, then 

m Xjn 

w= TT^xfn 

has a beta density with parameters a = m/2 and b — n/2. //// 


4.5 Student's t Distribution 


Another distribution of considerable practical importance is that of the ratio of 
a standard normally distributed random variable to the square root of an in- 
dependently distributed chi-square random variable divided by its degrees of 
freedom. That is, if Z has a standard normal distribution, if U has a chi- 
square distribution with k degrees of freedom, and if Z and U are independent, 
we seek the distribution of 

z~i-. 

■Jm 


The joint density of Z and U is given by 


<3o) 
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If we make the transformation X = Z}*J (/jk and Y = U, the Jacobian 15 
yfyjk and so 


* *■*- fei -> w 

f x {x) - | jx r(* y) d y 


n(fc+D/2] 1 


r(k/2) v ^(i + * J /fc) <t+,,/2 j 


(31) 


Definition 10 Student’s f distribution If X is a random variable having 
density given by Eq (31) then X is defined to have a Student s t distnbtt 
Uon or the density given in Eq (31) is called a Students t distribution 
tilth k degrees of freedom }}H 

We have derived the following result 

Theorem 10 If Z has a standard normal distribution if U has a chi- 
square distribution with k degrees of freedom and if Z and V are inde- 
pendent then ZfyfUfk has a Student s t distribution with k degrees of 
freedom //// 


The following corollary shows how the result of Theorem 10 is applicable 
to sampling from a normal population 


Corollary If X lf , X„ is a random sample from a norfnal distribution 
with mean p and variance <r 2 , then Z = (J? — p)/(W\/fO has a standard 
normal distribution and U = (I fo 2 ) J] (X, - X) 1 has a chi square distnbu 
tion with n - I degrees of freedom Furthermore, Z and U are inde- 
pendent (see Theorem 8) hence 


(X~ fi)!(<rt'Jn) yfrtjn - ])(X- g) 

v/O/O I <>,- *)’/(« -l)~ 

has a Student s t distribution with n - I degrees of freedom Ml 
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We might note that for one degree of freedom the Student’s t distribution 
reduces to a Cauchy distribution; and as the number of degrees of freedom 
increases, the Student’s l distribution approaches the standard normal distribu- 
tion. Also, the square of a Student’s t-distributed random variable with k 
degrees of freedom has an F distribution with 1 and k degrees of freedom. 

Remark If X is a random variable having a Student’s / distribution with 

k degrees of freedom, then 

d[X] = 0 if k > l and var [X] = k/{k - 2) if k > 2. (32) 

PROOF The first two moments of X can be found by writing 

X = Z/^/Ojk as in Theorem 10 and using the independence of Z and U. 

The actual derivation is left as an exercise. //// 

This completes Sec. 4 on sampling from the normal distribution. Note 
that we considered the distribution of functions of only two different statistics, 
namely, the sample mean and sample variance. In the next chapter we will find 
that these two statistics arc the only ones of interest in sampling from a normal 
distribution; they will turn out to be sufficient statistics. 


5 ORDER STATISTICS 


5.1 Definition and Distributions 

In Subscc. 2.4 we defined what we meant by statistic and then gave the sample 
moments as examples of easy-to-understand statistics. In this section the 
concept of order statistics will be defined, and some of their properties will be 
investigated. Order statistics, like sample moments, play an important role 
in statistical inference. Order statistics are to population quantiles as sample 
moments are to population moments. 

Definition 11 Order statistics Let X lt X 2 , ...» X„ denote a random 
sample of size n from a cumulative distribution function F(-). Then 
y £ Y 2 ^ ^ Y „ , where Y, are the X, arranged in order of increasing 

magnitudes and are defined to be the order statistics corresponding to 
the random sample X u . . . , X „ . HH 
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We note that the Y, are statistics (they are functions of the random sample 
X t X 2> , X') and are in order Unlike the random sample itself, the order 

statistics are clearly not independent, for if Yjety, then 

We seek the distribution both marginal and joint, of the order statistics 
We have already found the marginal distributions of Y t - min [X lt , X „ J and 
Y„ * max [Z lt , in Chap V Now we will find the marginal cumulative 
distribution of an arbitrary order statistic 

Theorem 11 Let Y t ^ Y 2 £ ^ Y n represent the order statistics from 

a cumulative distribution function T( ) The marginal cumulative distri 
bution function of Y«, a — 1, 2, , n, is given by 

Frjb>) - i ("jtfwni - w- 1 09 

proof For fixed y, let 

then 


£ 2, = the number of X t £ y 

Note that Y Z, has a binomial distribution with parameters n and F(y) 
Now 

*v.(y) = nr. ^y]= z t & a] * ^j[F(y)V[i - mr J 

The key step m the proof is the equivalence of the two events { 7, £ >} 
and Z, & a} If the ath order statistic is less than or equal to y, then 
surely the number of X% less than or equal to y is greater than or equal to 
a and conversely Jjjf 


Corollary F r „(y) = j] (”) [F(y)Hl - F(y)]- -* = [F(y)T, 
and 

F*{y) - £ ("jtfWtt - F(y)y' « 1 - [1 - F(y)r 
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Theorem 1 1 gives the marginal distribution of an individual order statistic 
m terms of the cumulative distribution function F(*). For the remainder of 
this subsection, we will assume that our random sample X lt ...,X„ came from a 
probability density function _/"(■); that is, we assume that the random variables 
Xi are continuous. We seek the density of Y a , which, of course, could be 
obtained from Eq. (33) by differentiation of Note that 

/r.G0 

_ lim F rSy + Ay) - Fyjy) _ lim P[y< r,<y + Ay] 

Ay-o Ay iy -. 0 Ay 

_j- m P[(«- l)of the Xj < y; oneX,- in(y,y + Ay];(n - a) of the X, >y + Ay] 
Ay-*o Ay 

f n ! [FQ>)3" l [F(y + Ay) - PQQ][1 - F(y + Ay)]— ] 

,(a — 1)11 !(« — a)! Ay / 

- ( — g^oi tfwi-P-iwrw 

We have made sensible use of the multinomial distribution. Similarly, we can 
derive the joint density of Y a and Y p for 1 < a < p < n. 

fr„ yM’ y) Ax Ay vzP[x < 7^ < x + Ax; y < Y p <y + Ay] 

«P[(a - 1) of the Xi < x; one X t in (x, x + Ax]; 

(P - a - 1) of the Xi in {x + A*, y]; 
one X t in (y, y + Ay]; (n — P) of the X t > y + Ay] 
n\ 

“(a- 1)!1!(/? -a- 1)1 U(n-P)\ 
x [P(x)] a-1 [P(y) - Fix + Ax)] p_ “ _I [l -F(y + A y)]"~ p f(x) Axf(y) Ay; 

hence 


= lim 

Ay-+0 


Yp(. x > y ) — 

■[P(*)] 3 ' 1 [P(y) - f(x)]' ,_£t " I [i -W'/W/W 

(a- 1)109 -a - 1)!(« -/?)’. 

for x <y, and 


-/*«. y) = ° forx>y. 
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In genera!,/*,,. »,(?!* *yX 


im - T L-p[y t <ir l £jri+4ri» ,y.<y.^y.+ 


- Iim -r^- ffone X, iu(y„ y t + Ayil, , one X t m (y„ f. + AyJl 

- Iim [F(Fi + hy t ) - F(y ,)] [F(y. + Ay.) - F(y„)] 

4 ^° n 4* 

i* t 

= n !/0 , i) /GO for y t <yi< <f«* 

and / r , r- (y,, . F») = 0, otherwise 

We have derived the following theorem 


Theorem 12 Let X it X 2 , , X„ be a random sample from the prob- 

ability density function /( ) with cumulative distribution function F( ) 
Let Yi £ Y 2 <; £ Y„ denote the corresponding order statistics, then 

- ,W ‘ m P4) 
/| I 

fr. tM 

*PwrKi)-Jwr 

(35) 

/n r„(Fi> >y«) 

^ Wfiyi) /(yj for Ft < Fa < <F. < 36 ) 

\o otherwise {Ilf 

Any set of marginal densities can be obtained from the joint density 
fr, ~ r.O’i* » F«) ^ simply integrating out the unwanted variables 


5.2 Distribution of Functions of Order Statistics 

In the previous subsection we derived the joint and marginal distributions of the 
cider statistics themselves In this subsection we will find the distribution of 
certain functions of the order statistics One possible function of the order 
statistics is their arithmetic mean, equal to 
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Note, however, that (!//;) £ (1/n) £ X„ the sample mean, which was the 

subject of See. 3 of this chapter. We define now some other functions of the 
order statistics. 


Definition 12 Sample median, sample range, sample midrange Let 
Y x < * • • ^ Y n denote the order statistics of a random sample X u X n 
from a density /(•). The sample median is defined to be the middle order 
statistic if n is odd and the average of the middle two order statistics if n 
is even. The sample range is defined to be Y n - Y u and the sample mid- 
range is defined to be ( Y x + Y„)I2. ml 

If the sample size is odd, then the distribution of the sample median is 
given by Eq. (34); for example, if n = 2k -f 1, where k is some positive integer, 
then 1^4 - 1 is the sample median whose distribution is given by Eq. (34). If the 
sample size is even, say n = 2k , then the sample median is (Y k + Y k + ,Y2, the 
distribution of which can be obtained by a transformation starting with the 
joint density of Y k and y i+1 , which is given by Eq. (35). 

We derive now the joint distribution of the sample range and midrange, 
from which the marginals can be obtained. 

By Eq. (35), wc have 

f r „ yS x > y) = «(» - wiy) - mr 2 mm for x < y . m 

Make the transformation R — Y„ — Y t and T—{Y t + y„)/2 or r=y — x and 
/ = (x + ;•)/ 2. Now x - t — r/2, and y—t+r/ 2; hence 

dx dx 
Tr Tt 

dy 

Tr Tt 

and we obtain Theorem 1 3. 

Theorem 13 If R is the sample range and T the sample midrange from 
a probability density function, then their joint distribution is given by 

/R.r( r » 0 = 

n{n - l)[F(t + r/2 ) - F(t - r/2)]"- 2 /(f - r/2)/(t + r/2) for r > 0, (38) 
and the marginal distributions are given by 


■i 1 
i 1 


= - 1 , 
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EXAMPLE 6 Let X lt , X. be a random sample from a uniform datiibu 
uon on (p - fio, p + yfie) Here p is the mean, and a 1 is the variance 
of the sampled population 

1 , 




V* 








_ i^irK-' 5 ' - >w 


h t(^» 0 = f+ju-ln iCO^to ivi«l( r ) (40) 

AW - \h rW 0 A = /-* - r)2 w V3 „(r) (41) 

We note that/ A (r) is independent of the parameter p 

/r(0 = J“A ifr.O* 

“pTS*/. w>. 

which simplifies to 

/rW "i75;(:7£ +1 ) ,, - ,w ('^) + 




V3»" 00/ 1 '\Jh> («> 

From Eq (41) we can derive <?[/?] = Xj3a{n — !)/(« + 1) //// 


Certain functions of the order statistics are agamstatisticsandmaybeused 
to make statistical inferences For example, both the sample median and the 
midrange can be used to estimate p, the mean of the population For the uni- 
form density given in the above example, the variances of the sample mean, the 
sample median and the sample midrange are compared in Problem 33 


5 3 Asymptotic Distributions 

In Subsec 3 3, we discussed the asymptotic distribution of the sample mean 
X„ We saw that X M was asymptotically normally distributed with mean p and 
variance c 2 jn We now consider the question Is there an asymptotic dis- 
tribution for the sample median We will state (without proof) a more general 
result 
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Since for asymptotic results the sample size n increases, we let y[ n) <, 
Y ( 2 *£-'<> Y { n n) denote the order statistics for a sample of size /z. The super- 
script denotes the sample size. We will give the asymptotic distribution of that 
order statistic which is approximately the (np)th order statistic for a sample of 
size n for any 0 <p < 1. We say “approximately*’ the (/ 7 p)th order statistic 
because np may not be an integer. Define p n to be such that n p n is an integer 
and p m is approximately equal to p; then is the (np„) til order statistic for a 
sample of size n. (If X l9 X n are independent for each positive integer n, we 
will say X u ...» X ni are independent.) 

Theorem 14 Let X u X n , ... be independent identically distributed 
random variables with common probability density /(•) and cumulative 
distribution function F(*) t Assume that F(x) is strictly monotone for 
0 <F(x) < I. Let i p be the unique solution in * of F(x) = p for some 
0<P<L {i p is the /?th quantile.) Let p n be such that np n is an integer 
and /zIPa—pI is bounded. Finally, let Y^l denote the (up n )th order 
statistic for a random sample of size n. Then is asymptotically 
distributed as a normal distribution with mean i p and variance 

P(l~p)!nm p )) 2 - mi 

EXAMPLE 7 Let p = then i p is the population median, and Theorem 14 
states that the sample median is asymptotically distributed as a normal 
distribution with mean the population median and variance l/4/7[/(f 1/2 )] 2 . 
In particular, if /(•) is a normal density with mean p and variance a 2 , 
then the sample median is asymptotically normally distributed with 
mean p and variance l/4n[/(p)] 2 = no 1 jin. Recall that the sample mean 
is normally distributed with mean p and variance c 2 jn. //// 

In Theorem 14 above we considered a certain kind of asymptotic distribu- 
tion of order statistics. We will now consider yet another kind. In the above 
we looked at the asymptotic distribution of that order statistic which was 
approximately the (np)th order statistic for a sample of size n , Such an order 
statistic had (approximately) lOOp percent of the n observations to its left. 
That is, its relative position remained unchanged as n, the sample size, in- 
creased ; it always had (approximately) the same percentage of the n observations 
to its left. We will now consider the asymptotic distribution of that order 
statistic whose absolute position remains unchanged. That is,* we con- 
sider the asymptotic distribution of, say, Y[ n) for fixed k and increasing n. 
Yl n) is the kth smallest order statistic for a sample size n 2> k, and k remains 
fixed. In order to make the presentation somewhat simpler, we will take k — 1 , 
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m which case yf 1 is the smallest of the n observations We note that we could 
just as well consider the klh largest order statistic, namely which for 

k « 1 specializes to Yi’\ the largest order statistic for a sample of size n Either 
or } / i ,) is often referred to as an extreme-value statistic 
Practical applications of extreme-value statistics are many The old 
adage that a chain is no stronger than its weakest link provides a simple example 
If Xi denotes the ‘strength ’* of the ith link of a chain with n similar links, then 
y*'* *= m m [ X, , , X.J is the ** strength ’ of the chain Also, in measuring the 

results of certain physical phenomena such as floods, droughts, earthquakes, 
winds, temperatures, etc , it can be seen that under certain circumstances one is 
more interested in extreme values than in average values For instance, n js 
the extreme earthquake or flood and rot the average earthquake or flood, that 
is more damaging We can see that results, whether exact or asymptotic, for 
extreme value statistics can be just as important as results for averages 

For the most part we will concentrate on finding the asymptotic distribu 
tion of Fi*’ One might wonder why we should be interested ut an asymptotic 
distribution of yj** when the exact distribution, which is given by Fr.e»0)** 
[F(y)]\ where F( ) is the c d f sampled from, is known The hope is that we 
will find on asymptotic distribution which does not depend on the sampled 
c.d f F( ) We recall that the central limit theorem gave an asymptotic dis- 
tribution for X, which did not depend on the sampled distribution even though 
the exact distribution of X, could be found 

In searching for the asymptotic distribution of let us pattern out 
development after what was done in deriving the asymptotic distribution of X, 
According to the law of large numbers, X, has a degenerate limiting distribution , 
that is, the limiting c d f of X, is the cumulative distribution that assigns all its 
mass to the point ft Such a limiting distribution is not useful if one intends to 
use the limiting distribution to approximate probabilities of events since it 
assigns each event a probability of either 0 or 1 To circumvent such difficulty, 
we first “centered” the values of X„ by subtracting ft, and then we “inflated" 
the values of X, — ft by multiplying them by y/nfa, and, consequently, we were 
able to get a nondegenerate limiting distribution, that is, according to the central- 
limit theorem, yfn{ ~p)/< r had a standard normal distribution as its limiting 
distribution A general procedure, when one is looking for a limiting distribu- 
tion of, say, Z., is to first “center” the Z„ by subtracting a constant, say o,, and 
to then “scale” Z„ - o„ by dividing by another constant, say b. In the case of 
the central limit theorem, Z, -» X„, o, sg, and b , *= ojjn In the case of 
Theorem 14 above, Z. » y<;>, a„~^, and b„ » Jf{\ For both 

of these two cases the sequence of constants {o.} did not depend on n In the 
case at hand, namely when Z„ = y^, the sequence of constants {ff,} is likely to 
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depend on n since Y l n n) tends to increase with n. Let us look at a couple of 
examples. 


EXAMPLE 8 Consider sampling from the logistic distribution; that is, 
F(x) = (l +e~ x )~K Find the limiting distribution of {Y l n n) ~ a n )/b„. 
There are two problems: First, what should we take the sequences of 
constants {a n } and {&„} to be? And, second, what is the limiting distribu- 
tion of (y‘ n) — o r )lb„ for the selected constants {o„} and {£„}? It seems 
reasonable that the “centering” constants {a„} should be close to <?[ Fn" 1 ]; 
so we seek an approximation to «?[y' n) ]. Now F(X t ), .... F(X„) is a 
random sample from the uniform distribution over (0, 1); hence F( y„ (n) ) 
is the largest of a sample of size n from a uniform distribution over (0, 1). 
That«5‘[F(yJ n) )] = n/(w + 1) can then be routinely derived. Now /"((?[ 7^"']) x 
<?[F(y'">)]or 

ftfW’D = {1 + exp (-^[y'" 1 ])}- 1 
= 1 - {I + exp (<5’[y^ n) ])} -1 
«1 -(« + l) _1 
n 

= /7+T 

=<?[/•( yjM)}, 


which implies that 
or that 


n«exp(<f[y^)] 


6[Y ( n n) ]& log, n. 

Finally, since 6[Y ( „ n) ] x\ogn (from here on we use log n for log c n), a 
reasonable choice for the sequence of “centering” constants {</„} seems 
to be the sequence {log n). We are seeking 


n-*co L 0„ J n-co L 


Y ( n n> - log n 


zy 


= UmP[Y ( „ n) ^b n y + log n] 

n-*co 

= lim [F(b„y + log ri)] n 

tt~* oo 

= lim (1 + e~ bny ~ ioen )~ n 

rt-* co 

= lim(l + (l/n)e-^ _n 

n-*co 

= exp(— e~ y ) for b„ = 1. 
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Hence, if (a,) and {£„} are selected so that {«,} ={Iogn} and (M ={{}, 
respectively, then the limiting distribution of (Yp-a.l'b. = P™-logH 
is exp (-*’0 llll 


EXAMPLE 9 Consider sampling from the exponential distribution so that 
f^x) = {1 — «)(*) Again, l et m l ^ c limiting distribution of 

( yj 4) - aj/b, As tn Example 8, <Z[F( Pj ,) )] = n/(n + 1) = I - !/(« + ] j 

Now 

- i - wp(- w’», 

and 

so 

j~ T *e*p{--W[J?>]}» 

or 

41 « j log (n + 1) a j log ru 

Hence, it seems reasonable to use a, = (1/A) log n 



= exp (-e~*) for b„ s ^ 


Hence the limiting distribution of ( Pj*’ — «= [Y^ — (I/i) log 

is exp (— « _J ) We note that we obtained the same limiting distribution 
here as in Example 8 Here we were sampling from an exponential 
distribution, and there we were sampling from a logistic distribution. 

//// 
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In each of the above two examples we were able to obtain the limiting 
distribution of (Y^ - a„)/b n by using the exact distribution of Y™ and ordinary 
algebraic manipulation. There are some rather powerful theoretical results 
concerning extreme-value statistics that tell us, among other things, what limiting 
distributions we can expect. We can only sketch such results here. The 
interested reader is referred to Refs. 13, 30, and 35. 

Theorem 15 Let iV„ ..., X„, ... be independent and identically dis- 
tributed random variables with c.d.f. F ( •). If ( Y ( n n) - a„)/b n has a limiting 
distribution, then that limiting distribution must be one of the following 
three types: 

CiO; •/) = 0 , *,(>•). Where y > 0. 

G 2 (y\ y) = e~ w 7 ( _ ao> o)0’) + / [0 . „)0’). where y > 0. 

G 3 (y) = exp (— e _J ). //// 


Theorem 15 states what types of limiting distributions can be expected. 
The following theorem gives conditions on the sampled F(-) that enable us to 
determine which of the three types of limiting distributions correspond to the 
sampled F('). 


Theorem 16 Let X Jt .... X n , ... be independent and identically dis- 
tributed random variables with c.d.f. F(-). Assume that (Yj, n) — a^/b„ 
has a limiting distribution. The limiting distribution is: 


(0 G t (-; y) if and only if 

1 - F(x) . n 

lim ry-4 = for every x > 0. 

x 1 F(XX) 

(ii) G 2 ( • ; y) if and only if there exists an x 0 such that 
F(x 0 ) = I and F(x 0 - c) < 1 for every e > 0. 
and 


lim 

OCc-O 


1 - F(x 0 - tx) _ 
1 - F(x 0 - x) 


for every x > 0. 


(iii) G 3 (*) if and only if 

lim n[l - F(P„x + a„)] = e~ x 


for each x, 
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where 

S^*)} 

and 

Pn - inf (2 1 - (»<?)'* £ K*„ + 2 )} lilt 

Note that if H ) is strictly monotone and continuous, then a„ is given by 
J\a a ) - (n - 1 )/« of a„ = and P„ is given by /-(a. + ft) = 1 - 

(w)"\ or p n “ -F -I (I - M"‘) ~a« => - {«<?}"*) - /“'({n - !}/«) 


EXAMPLE 10 Take F(x) = (1 - e -ix )/(o „)(x) as in Example 9 a n is such, 
that F(x„) = (« — l)/n or 1 — e~ x<t ' — 1 — Ifn, which implies that a, = 
(I/A) log n P„ is such that F(a„ + P„) =• 1 — (nc)" 1 or 1 — l# **" *?• 

- 1 - (nc)“‘, or p n =t 1/A 

limnfl - F(P„ x + a,)] = lim = e~ x for each X. 

so, as we saw in Example 9, the exponential distribution has CTj( - ) as its 
corresponding limiting extreme-value distribution //// 


EXAMPLE 1 1 Take F{x) = J\x, y) = [1 - (1 - x)>]/ (0 »(*) + / f , »,W Note 
that for x Q = 1, /Xx 0 ) = I and F(x 0 - e) < l for every c > 0 Also, 


hm 

o<*-o 


1 - F(x 0 - rx) 
l-F(x 0 -x) 


Jim 


(1 — x Q + tx)’ 

(1 -x 0 + *) T 


V, 


so fX .7) has GS , y) as its limiting extreme value distribution 


mi 


EXAMPLE 12 Take F\x) = F(x r y) the c.d f of a < distribution with y degrees 
of freedom 


1 — F(rx) 


. m /M 

imi — — - = 
*•*« */(™0 


lim 


{l + {xxfhV* tv * 

r(l + Jc*/y) <7+1,/1 


x\ 


so the t distribution with y degrees of freedom has (7 t ( , y) as its limiting 
extreme-value distribution III / 
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Theorem 16 gives conditions on the sampled c.d.f. F(-) that enable us to 
determine the proper limiting extreme-value distribution for ( Y — a n )!b n . The 
theorem does not tell us what the constants {a n } and {b n } should be. If, how- 
ever, the conditions for the third type are satisfied, then we have 

n[l — + as « oo, 

and 

r Y (n) -a 1 

— ”~ X \ ~* ex P(“0* 

Now, P[(Y ( n n) - a n )/b n < x] = [F(b n x + a n )] n ; hence 

[F(b n x + a n )] tt -► exp 
or 


n\ogF(b„x + a n )-+ -e~ x 9 
or 


n[l — F(b n x + a n )] -+ e~ x ; 


and we see that a n can be taken equal to a n and b n = p n . Thus, for the third 
type the constants {#„} and {b n } are actually determined by the condition for that 
type. We shall see below that for certain practical applications it is possible to 
estimate {a n } and {b n }> 

Since the types ; y) and <7 2 (* ; y) both contain a parameter, it can be 
surmised that the third type (7 3 (*) is more convenient than the other two in 
applications. Also, C 3 (y) = exp(-e‘0 is the correct limiting extreme-value 
distribution for a number of families of distributions. We saw that it was 
correct for the logistic and exponential distributions in Examples 8 and 9; it is 
also correct for the gamma and normal distributions. What is often done in 
practice is to assume that the sampled distribution F(-) is such that exp (— e~*) 
is the proper limiting extreme-value distribution; one can do this without assum- 
ing exactly which parametric family the sampled distribution F ( •) belongs to. 
One then knows that P[( Y ( n n) - a„)/b„ exp ( - O for every y as n -* co. 

Hence, 

f [ y " n I ~ * y ] ~ ex p(- e_ ^ 


for large fixed n. Or 


P[ Y ( n n) <a„ + b„y] w exp (-e' 31 ), 


-(. z-a n )/b n ^ 


or 


P[y' n) <z]»ex p(-e 
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It is true that a, and b„ are given m terms of the (1 — l/n)th quantile and the 
(1 - l/«e)th quantile of the sampled distribution, however, for certain apphca 
tions they can be estimated in which case we would have an approximate distrv. 
bution for Y?\ a distribution that is valid for a variety of different distributions 
that could he sampled from (One might note that in applications of the central 
limit theorem, which states that Y n is approximately distributed as N[p, ^/n), 
often p and ff 2 are unknown and consequently they also have to be estimated ) 
The preceding indicates how powerful the asymptotic extreme value theory can 
be We have merely introduced the subject For instance, we stated some re 
suits for the asymptotic distribution of yjj*', one could state similar results 
for y{" ) , or Y^-n-i The interested reader is referred to Refs 13, 30, 
and 35 

5 4 Sample Cumulative Distribution Function 

'Wehave repeatedly stated m this chapter that our purpose in sampling from some 
distribution was to make inferences about the sampled distribution, or popula 
tion, which was assumed to be at least partly unknown One question that 
might be posed is Why not estimate the unknown distribution itself? The 
answer is that we can estimate the unknown cumulative distribution function 
using the sample or empirical, cumulative distribution function, which is a func- 
tion of the order statistics 

Definition 13 Sample cumulative distribution function Leiy It .Y 2 , , 
X„ denote a random sample from a cumulative distribution function F( ) 
and let f ( 5 Yj s <, Y n denote the corresponding order statistics 
The sample cumulative distribution function, denoted by F M (x), is defined by 
F,(x)s=(l/n)K ( number of Yj less than or equal to x) or, equivalently, 
by F„(x) =(1 fn) x (number of X, less than or equal to x) III! 

For fixed x F„(x) is a statistic since it is a function of the sample (The 
dependence of F (x) on the sample may not be clear from the notation itself) 
We shall see tfu t F n (x) has the same distribution as that of the sample mean of a 
Bernoulli distr bution 

Theorem 17 Let /"„(*) denote the sample cumulative distribution function 
of a random sample of size n from F( ), then 

P [f&r) = = Q [FWrp - F(*)r\ k - 0, 1, (43) 
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proof Let Z, — /(-„,*]( A',); then Z f has a Bernoulli distribution 

ft 

with parameter F(x). Hence, XZ /? which is the number of X, less than 

i 

or equal to x> has a binomial distribution with parameters n and F(x). But 
F n {x) = (I AO x z,. The result follows. //// 

Much more could be said about the sample cumulative distribution func- 
tion! but we will wait until Chap. XI on nonparamctric statistics to do so. 


PROBLEMS 

J (a) Give an example where the target population and the sampled population 
arc the same. 

( b ) Give an example where the target population and the sampled population 
arc not the same. 

2 (a) A company manufactures transistors in three different plants A } B, and C 

whose manufacturing methods arc very similar. It is decided to inspect those 
transistors that arc manufactured in plant A since plant A is the largest plant 
and statisticians arc available there. In order to inspect a week’s produc- 
tion, 100 transistors will be selected at random and tested for defects. Define 
(he sampled population and target population. 

(6) In part (n) above, it is decided to use the results in plant A to draw conclu- 
sions about plants B and C Define the target population, 

3 ( a ) What is the probability that the two observations of a random sample of two 

from a population with a rectangular distribution over the unit interval will 
not differ by more than I? 

(6) What is the probability that the mean of a sample of two observations from a 
rectangular distribution over the unit interval will be between } and i? 

4 (a) Balls are drawn with replacement from an urn containing one white and two 

black balls. Let X = 0 for a white ball and X~l for a black ball. For 
samples X u X 2 , . . . , X 9 of size 9, what is the joint distribution of the observa- 
tions? The distribution of the sum of the observations? 

(b) Referring to part (a) above, find the expected values of the sample mean and 
sample variance. 

5 Let Xu ...» X m be a random sample from a distribution which has a finite fourth 
moment. Define /x ~6[X}], a 2 — var [yV,],/x 3 =£[(Xi — p) 3 ],p* = £[(Xt — /x) 4 J, 

X-(l/n) 7 Xu and S 2 - [!/(*- DJ 2 (X t -X) 2 . 

X 1 

(a) Docs S 2 ~ [l/2/;(n— I)] 2 I 

i ** i j m i 
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•(6) Find var [S 1 ] 

•(c) Find cov and note that cov [X, S s ] = 0 if go =0 

Possible Hint S(* - ?)' - Z(X, - XY + (l/fl)£S CH -#«XAl-rt 

6 *(<») For a random sample of size 2 from a population with a finite (2r)th moment 

find S[\f I and var [Af,] where hi. = (I/fl) 2 (A 1 , - T.)' 

(6) For a random sample of size n from a population with mean p and rth 
central moment fu, show that 

7 (a) Use the Chebyshev inequality to find how many times a com must be tossed 

in order that the probability will be at least 90 that X will lie between .4 
and 6 (Assume that the com is true) 

(6) How could one determine the number of tosses required m part (a) more 
accurately i e make the probability very nearly equal to 90’ What is the 
number of tosses? 

S If a population has <r~ 2 and X is the mean of samples of size 100, find limits 
between which X-p will lie with probability 90 Use both the Chebyshev 
inequality and the central limit theorem. Why do the two results differ? 

9 Suppose that X, and X } are means of two samples of size n from a population 
with variance <7 l Determine n so that the probability will be about 01 that the 
two sample means will differ by more than <r (Consider Y = Xi — Xi ) 

10 Suppose that light bulbs made by a standard process have an average life of 2000 
hours with a standard deviation of 250 hours, and suppose that it is considered 
worthwhile to replace the process if the mean life can be increased by at least 
10 percent An engineer wishes to test a proposed new process, and he is willing 
to assume that the standard deviation of the distribution of hves is about the 
same as for the standard process How large a sample should he examine if he 
wishes the probability to be about 01 that he will fail to adopt the new process if 
ui fact U produces bulbs with a mean life of 22S0 hours 7 
// A research worker wishes to estimate the mean of a population using a sample 
large enough that the probability will be 95 that the sample mean will not differ 
from the population mean by more than 25 percent of the standard deviation. 
How large a sample should he take? 

12 A polling agency wishes to take a sample or voters In a given state large enough 
that the probability is only 01 that they will find the proportion favoring a certain 
candidate to be less than 50 percent when in fact it is 52 percent. How large a 
sample should be taken? 

IS A standard drug is known to be effective in about 80 percent of the cases in which 
it is used to treat infections A new drug has been found effective in 85 of the 
first 100 cases tried Is the superiority of the new drug well established? (If 
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(he new drug were as equally effective as the old, what would be the probability 
of obtaining 85 or more successes in a sample of 100?) 

14 Find the third moment about the mean of the sample mean for samples of size n 
from a Bernoulli population. Show that it approaches 0 as /; becomes large 
(as it must if the normal approximation is to be valid). 

15 (u) A bowl contains five chips numbered from 1 to 5. A sample of two drawn 

without replacement from this finite population is said to be random if all 
possible pairs of the five chips have an equal chance to be drawn. What is 
the expected value of the sample mean ? What is the variance of the sample 
mean? 

(6) Suppose that the two chips of part (o) were drawn with replacement; what 
would be the variance of the sample mean? Why might one guess that this 
variance would be larger than the one obtained before? 

*(c) Generalize part (a) by considering N chips and samples of size n . Show that 
the variance of the sample mean is 


c 1 N ~ n 
ItN-l' 


where a 2 is the population variance; that is 



N + 1 
2 



16 UX*, X 2 ♦ arc independent random variables and each has a uniform distribu- 
tion over (0, I), derive the distribution of (A' t 4- X 2 )I2 and ( X\ 4- X 2 + X 2 )/3, 

17 If X u X n is a random sample from A (ft, cr 2 ), find the mean and variance of 


S- 



18 On the F distribution; 

(ff) Derive the variance of the F distribution, [See part (d). ] 

(b) If X has an F distribution with m and n degrees of freedom, argue that l/X 
has an F distribution with n and m degrees of freedom. 

(c) If A' has an F distribution with m and n degrees of freedom, show that 

mXjn 

y 1 + mXln 

has a beta distribution. 

(d) Use the result of part (c) and the beta function to find the mean and variance 
of the F distribution. [Find the first two moments of mX/n ~ Wj{\ — IV)], 
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]9 On the t distribution 

(a) Find the mean and variance of Student's t distribution. {Be careful about 
existence ) 

(i b ) Show that the density of a t distributed random variable approaches the 
standard normal density as the degrees of freedom increase. (Assume that 
the “constant" part of the density does what it has to do ) 

(e) If X is /-distributed, show that X 1 is /"-distributed 
00 If X is r-distributed with k degrees of freedom, show that 1/(1 + fcjj 
a beta distribution 

20 Let Xu X x be a random sample from N( 0, 1) Using the results of Sec. 4 of 
Chap VI, answer the following 

(a) What is the distribution of (X, - X t )lV21 

(ft) What is the distribution of (Xi + *,)*/(*» - JQ*? 

(c) What is the distribution of (Jf, + 

GO What is the distribution of l/Z itZ = Xf/Xi ? 

21 Let X„ , X. be a random sample from N(0, 1) Define 

and 1 

Using the results of Sec 4, answer the following 
(a) What is the distribution of )(& + A”,.*)? 

(ft) What is the distribution of kX} -f (« — k)X.K k 2 
<c) What is the distribution of XUX\ ? 

GO What is the distribution of X,/X,l 

22 Let X„ , X, be a random sample from Nfa, a 1 ) Define 

*-&*• 

*— srkl.i.w-^0' 


and 
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Using the results of Sec. 4, answer the following: 

(a) What is the distribution of u~ 2 [(k - 1)S» 2 + Q,-k~ US 2 .,]? 

(b) What is the distribution of (i)(JT, 

(c) What is the distribution of o~ 2 (X, — /x) 2 ? 

(d) What is the distribution of Si/5 2 .,? 

(e) What is the distribution of (J— ft)/(S/Vn)? 

23 LetZi.Z* be a random sample of size 2 from N{ 0, 1) and Zi, X 2 a random sample 
of size 2 from N(l, I). Suppose the Z,'s are independent of the Z/s. Use the 
results of Sec. 4 to answer the following: 

(a) What is the distribution of Z-f-Z? 

C b ) What is the distribution of (Z, + Z 2 )/V [(Z 2 - A',) 2 + (Z, - Zi) 2 ]/2? 

(c) What is the distribution of [(X, - X 2 ) 2 + (Z, — Z 2 ) 2 + (Z, + Z 2 ) 2 ]/2 ? 

(</) What is the distribution of (Z 2 + Zt — 2 ) 2 /(Z 2 — Zj) 2 ? 

24 Let Zi be a random variable distributed N(i, i 2 ), /= 1, 2, 3. Assume that the 
random variables Zi, Z 2 , and Z 3 are independent. Using only the three random 
variables Z,, Z 3 , and Z 3 : 

(a) Give an example of a statistic that has a chi-square distribution with three 
degrees of freedom. 

(b) Give an example of a statistic that has an F distribution with one and two 
degrees of freedom. 

(c) Give an example of a statistic that has t distribution with two degrees of 
freedom. 

25 Let Zi, Xi be a random sample of size 2 from the density 

fix) = =,(x). 

Use results on the chi-square and F distributions to give the distribution of 
Z,/Z 2 . 

26 Let Ui, U 2 be a random sample of size 2 from a uniform distribution over the 
interval (0, 1). Let Y, and Yi be the corresponding order statistics. 

(a) For 0 <y 2 < 1, what is f n \ r 2 », 2 (yib 2 )> the conditional density of 7, given 

Y t = yi l 

(f>) What is the distribution of Y 2 — 7, ? 

27 If Zi, X 2 , X, are independently and normally distributed with the same 
mean but different variances a\, cr 2 , . . . , al and assuming that U = £(Z,/ct 2 )/£(1/ct 2 ) 
and V= S(Z, - U) 2 /af are independently distributed, show that U is normal and 
V has the chi-square distribution with n— 1 degrees of freedom. 

28 For three samples from normal populations (with variances a 2 ,, ai, and a]), the 
sample sizes being n,, rt 2 , and n it find the joint density of 

S? „ rr §2 

u= si and ^ = §V 

where the Si, Si, and Si are the sample variances. (Assume that the samples 
arc independent.) 
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29 Let a sample of size «» from a normal population (with variance oj) have sample 
variance S?, and let a second sample of size from a second normal population 
(with mean and variance a\) have mean X and sample variance S| . Find the 
joint density of 


V»T(^-/xi) . .. SI 

u ~— r t — and 


(Assume that the samples are independent ) 

30 For a random sample of size 2 from a normal density with mean 0 and variance I, 
find the distribution of the range 

31 (a) What is the probability that the larger of two random observations from any 

continuous distribution will exceed the median? 

(4) Generalize the result of part (a) to samples of size n 

32 Considering random samples of size n from a population with density /(x), what 
is the expected value of the area under /( x ) to the left of the smallest sample 
observation 1 

*33 Consider a random sample Jft, , X , from the uniform distribution over the 
interval (fi—VJo, ji + VJo) Let Y, S S K denote the corresponding 
order statistics 

(a) Fmd the mean and variance of Y. — Y, 

(4) Find the mean and variance of ( + JV)/2 

(c) Find the mean and variance of Y*+ 1 ifn *=2£ + 1, k = 0, 1 

(d) Compare the variances of X,, K* ■, ( Y f + Y„)/2 

Hint It might be easier to solve the problem for V u ... t/», a random sample 
from the uniform distribution over either (0, 1) or (—1, 1), and then make an 
appropriate transformation 

34 Let X„ , X„ be a random sample from the density 


/<*,«, P)=~W[~ |(* - «)/£[], 

where —«<«<<» and /?> 0 Compare the asymptotic distributions of the 
sample mean and the sample median In particular, compare the asymptotic 
variances 

*35 Let Xi, , X, be a random sample from the cumulative distribution function 
F(x ) = {1 — exp l—xfcl — x )]}/<9 i)(r) + /„ «,(*) What is the limiting distri- 
bution of(YJ-> — a„)/b* , where a. = log n/( 1 + log n) and b « 1 ** (log w)(l + log «)? 
What is the asymptotic distribution of Yl*7 
36 Let X t , , X, be a random sample from fix, 6) ®= de~*‘I t o «>(*), 9 > 0 

(o) Compare the asymptotic distribution of X, with the asymptotic distribution 
of the sample median 

(4) For your choice of {a,} and (M. find a limiting distribution of ( Yi' } — o,)/6» 
(c) For your choice of (a.) and (4.} find a limiting distribution of(Yi*» - «.)/4.. 
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PARAMETRIC POINT ESTIMATION 


1 INTRODUCTION AND SUMMARY 

Chapter VI commenced with some general comments about inference . There, 
it was indicated that a sample from the distribution of a population is useful 
in making inferences about the population. Two important problems in 
statistical inference are estimation and tests of hypotheses . One type of estima- 
tion, namely point estimation , is to be the subject of this chapter. 

The problem of estimation, as it shall be considered herein, is loosely 
defined as follows: Assume that some characteristic of the elements in a popula- 
tion can be represented by a random variable X whose density is f x ( • ; 0 ) = 
/(•; 0), where the form of the density is assumed known except that it contains 
an unknown parameter 0 (if 0 were known, the density function would be com- 
pletely specified, and there would be no need to make inferences about it). 
Further assume that the values x 1? x 2 , of a random sample X lf X 2 X n 
from /(' ; 0 ) can be observed. On the basis of the observed sample values 
Xi,x 2 y ...» x n it is desired to estimate the value of the unknown parameter 0 or 
the value of some function, say t( 0), of the unknown parameter. This estima- 
tion can be made in two ways. The first, called point estimation , is to let the 
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MB 

value of some statistic say 4{X t , . X^ represent, or estimate, the unknown 

t( 0), such a statistic /(X t , X A ) is called a point estimator The second, 
called intend estimation, is to define two statistics, say , X^ md 

j t (Y t ,X,) where A(^i. ,XJ<d 3 (X it , XJ t so that (Atf,, ,XJ 
/jff, , Xj) constitutes an interval for which the probability can be deter 
mined that it contains the unknown r(0) For example, if /( , 6) is the noirral 
density that is, 

»«-«. M-js ,— H (^)T 

where the parameter 0 is (p o) and tf it is desired to estimate the mean, that is, 
t((J) t= p then the statistic X = (l/n) £ X% is a possible point estimator of 
r{0) = p and {X -2 JWJn X + 2 v/S^n) is a possible interval estimator 
of r(fl)ep {Recall that S J «= [\j(n — 1)1 £ (df, — X) 1 } Point estimation 

will be discussed m this chapter and interval estimation m the next. 

Point estimat ion admits two problems the first, to devise some means 
of obtaining a statistic to use as an estimator, the second, to select criteria and 
techni ques to defin e and find a * best * esti mator among ma ny possible estima 
ton Several methods of finding point estim ators are introduced nTSecTl 
~One of _IBeseTand~ probably .t he most~important, i s t h <TJneVw^of~m axhnurn 
likelihood In Sec 3 several* optimum 1 ' properties that an estimator orsequena 
"STestimators may possess are defined Theseincludecloseness, biasand variance, 
efficiency, and consistency The loss and risk functions, essential elements 
in decision theory, are defined as possible tools m assessing the goodness of 
estimators 

Section 4 is devoted to sufficiency, an important and useful concept m the 
study of mathematical statistics that will also be utilized in succeeding chapters. 
Unbiased estimation is considered in Sec 5 The Cramer Rao lower bound 
for the variance of unbiased estimators is given, as well as the Rao-Blackwell 
theorem concerning sufficient statistics A brief look at tnvanant estimators is 
presented m Sec 6 A Bayes estimation is considered in Sec. 7 A Bayes 
t gemoiczn * %mti m Vat tsk an tif Ynt poVunor or irom Vnt decision ttmtnttoci 
viewpoint as an estimator having smallest average risk. Some results in the 
simultaneous estimation of several parameters are given in Sec 8 Included is 
the notion of ellipsoid of concentration of a vector of point estimators and the 
Lehmann Schefft theorem. Section 9 is devoted to a brief discussion of some 
optimum properties of maximum likelihood estimators 
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Frequent use of some of the distribution-theoretical results for statistics, 
which were derived in earlier chapters, especially Chaps. V and VI, will be 
noted throughout this chapter. After all, estimators are statistics, and to study 
properties of estimators, it is desirable to look at their distributions. 


2 METHODS OF FINDING^ESTIMATORS 

Assume that X u . . . , X„ is a rafodbm sample from a density /(• ;0), where the form 
of the density is known but the parameter 0 is unknown. Further assume that 
0 is a vector of real numbers, say 0 = (0„ ..., 6 k ). (Often k will be unity.) 

We sometimes say that 6 U .... 0* are k parameters. We will let 6, called the 
parameter space, denote the set of possible values that the parameter 0 can 
assume. The object is to find statistics, functions of the observations X U ...,X„, 
to be used as estimators of the 9j , j = 1 , . . . , k. Or, more generally, our object 
is to find statistics to be used as estimators of certain functions, say x k (0), . . . , t/ 0), 
of 0 = (0j, . . . , 0*). A variety of methods of finding such estimators has been 
proposed on more or less intuitive grounds. Several such methods will be 
presented, along with examples, in this section. Another method, that of the 
method of least squares will be discussed in Chap. X. 

An estimator can be defined as in Definition 1. 

Definition 1 /[Estimator , Any statistic .(known function of observable 
random variables that is itself a random variable) .whose va lues are' used 
to estimate x (0), where -r(-) is some function of the parameter 0, is defined 
to be an estimator of r(0)7] . . //// -. 

■v* V *" 

An estimator is always a statistic which is both a ran dom variable and a _ 
function. For instance^ suppose A^^T^LsIfrandomsample from a density U 
/(• ; 0) and it is desired to estimate r(0), where r(-) is some function of 0. Let 
i{X y , ..., X„) be an estimator of t( 0). The estimator /(X u X n ) can be ' 
thought of in two related ways: first, as the random variable, say T, where . 

T = 4(X U X„), and, second, as the function /(•,..., •)• Naturally, one • 
needs to specify the function /(', ..., •) before the random variable T= . 

. . . , X„) is defined. In all we have three types of tees : the capital Latin T , 
which represents the random variable -({X x , ..., X n ), the small script /, which 
represents the function /(' , •)> anc ^ sma ^ Latin t, which represents a 

value of T; that is, t = i(x k , .... x„). Letus a dopt the convention of calling 
the statistic (or random variable) that is used~as* an estimato r an estima tor 
anH“ calling a value that the statistic takes on an “estim ate.” Thus the word_ 
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“estimator” stands for the function, and the w ord “estimate” s tands fa & 
value of that function , for example, 7, =- ^ X, is an estimator of a mean & 

and x t is an estimate of H Here T ts X,, t n x,, and /(■ , ..,*>« the function 
defined by summing the arguments and then dividing by n 

Notation in estimation that has widespread usage is the following 0 K 
used to denote an estimate of 0, and, more generally, (0„ .... is a vector 
that estimates the vector (0 t » , 00, where 0 t estimates 0 i% j** I, k If 

0 is an estimate of 0, then 0 is the corresponding estimator of 0, and if the 
discussion requires that the function that defines both 5 and 0 be specified, then 
it can be denoted by a small script theta, that is, 0 = AT,) 

When we speak of estimating 0, we are speaking of estimating the fined 
yet unknown value that 0 has That is, we assume that the random sample 
A',, , X, came from the density /( , 0), where 0 is unknown but fixed Our 

object is, after looking at the values of the random sample, to estimate the fixed 
unknown 0 And when we speak of estimating i (0), we are speaking of estimat- 
ing the value t(0) that the known function x( ) assumes for the unknown but 
fixed 0 



Let /( , 0j, , 0*1 be a density of a random variable X which has k parameters 

0„ , 0 t As before let p' denote the rth moment aboutO, that = 

In general p' will be a known function of the k parameters 0 lt . 0 k Denote 

this by writing .. , 0*) Let X t X n be a random sample 

from the density /( , 0 X , .. , 0 k ), and, as before, let MJ be the jth sample 
moment, that is. 

Form the k equations 

.0*). j~l, ...k. (I) 

in the k variables 0,, 0 4 , and let O t , .. , O t be their solution (we assume 
that there is a unique solution) We say that the estimator (Oj, ..., 0i), 
where 0j estimates 0,,is the estimator of (0 it .... 0 k ) obtained by the method of 
moments The estimators were obtained by replacing population moments by 
sample momentsj Some examples follow 
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EXAMPLE 1 ^Let • • * » -^/j be a random sample from a normal distribution 
with mean p and variance a 2 . Let (0,, 0 2 ) = <p,o). Estimate the param- 
eters p and a by the method of moments. Recall that a 2 = n 2 - (ji\) 2 
and p = pi- The method-of-moments equations become 

M ’\ = /<i = (p> «r) = p 

M’ 2 = n' 2 = p^(p, <r) = o- 2 + p 2 , 

and their solution is the following: The method-of-moments estimator 
of p is M\ = X, and the method-of-moments estimator of a is 
JmI -X 2 = TOAO I *, 2 - Z 2 = v/I (^i - A") 2 /”- Note that the 
method-of-moments estimator of cr given above is not ^/SV //// 


EXAMPLE 2 Let A',, be a random sample from a Poisson distribu- 

tion with parameter X. Estimate X . There is only one parameter, hence 
only one equation, which is 

M\=ix\=ii\{X) = X. 

Hence the method-of-moments estimator of X is M\ = X , which says 
estimate the population mean X with the sample mean x. //// 


EXAMPLE 3 I^t X u X n be a random sample from the negative expo- 
nential density /(x; 0) = Oe~ 0x J i0t ^(x). Estimate 0. The method-of- 
moments equation is 

Mi = pi = p',(0) = 

hence the method-of-moments estimator of 0 is 1/Mi = //// 

EXAMPLE 4 Let A',, be a random sample from a uniform distribu- • 
tion on (p - ,/3tr, p + Here the unknown parameters are two, 

namely p and <r, which are the population mean and standard deviation. 
The method-of-moments equations are 

M[ = pi = Pi Oh o') = 

M' 2 = pi = pi(p. ff) = cr 2 + p 2 ; <o 


and 
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hence the method of moments estimators are X for p and 


for c. 


We shall see later that there are better estimators of p and a for this 

distribution. /(// 


Method of moments estimators are not uniquely defined The method- 
of moments equations tn E<t <$) ate oWamed by usutfc the first k xnt 
moments Central moments (rather than raw moments) could also be used to 
obtain equations whose solution would also produce estimators that would be 
labeled method of-moments estimators Also, moments other than the first 
k could be used to obtain estimators that would be labeled method of-moments 
estimators 

If, instead of estimating (0|» » 0*), method of-moments estimators of, 

say, t,(0, , , 6 t ), , f,(0j, . 0*) are desired, they can be obtained in several 

ways One way would be to first find method-of-moments estimates, say 

&i, ,0,,of 0 lt , 0* and then use xj$ it , 9*) as an estimate of x/,9 t 0*) 

fory = 1, , r Another way would be to form the equations 

M) = pj( T lt . , r,), j - 1 r 

and solve them for r,, , t f Estimators obtained using either way are called 

method of moments estimators and may not be the same in both cases 


j. 2\ MaxrmuptXikebbood 

jr6 introduce the method of maximum likelihood, consider a very simple estima- 
tion problem Suppose that an urn contains a number of black and a number 
of white balls, and suppose that it is known th3t the ratio of the numbers is 
3/1 but that it is not known whether the black or the white balls are more 
numerous That is, the probability of drawing a black ball is either i or j 
If n balls are drawn with replacement from the urn, the distribution of X, the 
number of black balls, is given by the binomial distribution 


f{x,p) => Q pV* for * = 0, 1, 2, . . ,n, 
where q = 1 — p and p is the probability of drawing a black ball Here p =* i, 

°r/> = 3 

We shall draw a sample of three balls, that is, n =* 3, with replacement and 
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attempt to estimate the unknown parameter p of the distribution. The estima- 
tion problem is particularly simple in this case because we have only to choose 
between the two numbers .25 and .75. Let us anticipate the results of the 
. drawing of the sample. The possible outcomes and their probabilities are given 
below: 


V 




Outcome:* 

0 

1 

2 

3 

/(*.* i) 

- 1 - 

8* 

9 

84 


u 

Ax; i) 

9 7 

84 

9 7 

0 4 


h 




,v 


In the present example, if we found x = 0 in a sample of 3, the estimate .25 for 
p would be preferred over .75 because the probability is greater than 
i.e. f because a sample with * = 0 is more likely (in the sense of having larger 
probability) to arise from a population with p ~ \ than from one with p =$. 
And in general we should estimate p by .25 when x = 0 or 1 and by .75 when 
* — 2 or 3. The estimator may be defined as 

[ ?5 a**- ~ _ n i 


for x = 0,1 
for x = 2, 3. 


The estimator thus selects for every possible jc the value of p , say p , such that 


f(x;p)>f(x;p'), 


where p* is the alternative value of p. 

More generally, if several alternative values of p were possible, we might 
reasonably proceed in the same manner. Thus if we found x = 6 in a sample 
of 25 from a binomial population, we should substitute all possible values of p 
in the expression 

/(6 ; p) = ( 2 6 5 ) p 6 ( 1 - p) 19 for 0 < p <; 1 (2) 

and choose as our estimate that value of p which maximized /(6; p). For the 
given possible values of p we should find our estimate to be The position 
of its maximum value can be found by putting the derivative of the function 
defined in Eq. (2) with respect to p equal to 0 and solving the resulting equation 
for p. Thus, 


fm v) = ( 2 6 5 ) p ! o - j>n«i - p) - isrf. 
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and on putting this equal to 0 and solving for p, we find that p = 0, 1, -fr are 
the roots The first two roots give a minimum, and so out estimate is therefore 
p = £ This estimate has the property that 

/( 6 , P) >/( 6 , p") 


where p is any other value of p in the interval 0 £ p £ 1 

In order to define maximum likelihood estimators, we shall first define the 
likelihood function 


/ 


Definition 2 [Likelihood function The likelihood function of n random 
variables X U X 2 , X n is defined to be the joint density of the n random 

variables say f Xl *„{x„ , x„, &), which is considered to be a function- 

of 9 In particular, if X 2 , , X„ is a random sample from the density 

f(x, ff), then the likelihood function is /(*, , &)f(x 2 , 0) //// 

Notation To remind ourselves to thmk of the likelihood function as a 
function of 0 we shall use the notation L(G,Xj , x„) or L( ,x„ ,x,) 

for the likelihood function //// 


The likelihood function L(0, x lt , x„) gives the likel ih ood that the 
random v ariables assume a particular value x u x 2 , ,x n The likelihood is 
the value of a density function so for discrete random variables it is a proba 
bihty Suppose for a moment that 0 is known, denote the value by-0 o The 
particular value of the random variables which j$ “most likely to occur* is that 
value x„ x 2 x„ such that f Xl xf x u ♦ x * Go) is a maximum For 
example, for simplicity let us assume that n = 1 and X t has the normal density 
with mean 6 and variance 1 Then the value of the random variable which is 
most likely to occur is X 2 <= 6 By “most likely to occur” we mean the value 
*1 of X t such that <p 6 ,(Xt) > <j> 6 ,(xj) Now let us suppose that the joint 
density of n random variables is f Xl x (x 4 , , x„ , 0), where 0 is unknown 

Let the particular values which are observed be represented by xf, x 2 , , x„ 

We want to know from which density is this particular set of values most likely 
to have come We want to know from which density (what value of ff) is the 
likelihood largest that the set xj, , x„ was obtained In other words, we 
want to find the value of & in g denoted by §, which maximizes the likelihood 
function L(9, x lt The value 5 which maximizes the likelihood function 

is, in general, a function of ,x„, say 0 = $(x„x 2 , , x„) When this is 

the case, the random variable 0 = , X„) is calted the maximum 

likelihood estimator of 6 (We are assuming throughout that the maximum 
of the likelihood function exists ) We shall now formalize the definition of a 
maximum likelihood estimator 
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Definition 3 Maximum-likelihood estimator Let 

1(0) =L(0; x k ,...,x„) 

be the likelihood function for the random variables X k , X 2 , X„. If 
0 [where 0 = 0(xi, x 2 , . .. , x „ ) is a function of the observations x k , x„] 
is the value of 0 in (5 which maximizes Z.(0), then © = §(X U X 2 , .... X„) 
is the maximum-likelihood estimator of 0. 0 = S(x 1 , *„) is the 

maximum-likelihood estimate of 0 for the sample x lf ..., x„. HI I 

The most important cases which we shall consider are those in which 
A'j. X 2 , X„ is a random sample from some density f{x\ 0), so that the 
likelihood function is 


m=f{x i;0)/(x 2 ;0)---/(x n ;0). 

Many likelihood functions satisfy regularity conditions; so the maximum- 
likelihood estimator is the solution of the equation 


dL(0) 

dO 


= 0 . 


Also L(0) and log L(0) have their maxima at the same value of 0, and it is some- 
times easier to find the maximum of the logarithm of the likelihood. 

If the likelihood function contains k parameters, that is, if 

o 2 , .... 0 *) - n/(*«; 01 .02 0 *). 

i- 1 

then the maximum-likelihood estimators of the parameters 0 U 0 2 , .... 0 k are 
the random variables = S 1 (A'i, ..., X„), © 2 = § 2 (^ 1 * • ••> -^n)> •••> E k — 
S k (X u X n ), where 0„ 0 2 , .... 0* are the values in § which maximize 
, 0 2 , 0/1). 

If certain regularity conditions are satisfied, the point where the likelihood 
is a maximum is a solution of the k equations 

0K0 1 ,...,0*) _ O 

50 k 

dL(e u ...,e k )_ 

so 2 


aL(o„...,o t ) 

eo k 

In this case it may also be easier to work with the logarithm of the likelihood. 
We shall illustrate these definitions with some examples. 
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EXAMPLE 5 Suppose that a random sample of size n is drawn from the 
Bernoulli distribution 

fix, p) = pV -V (0 0£p£lzndq-l-p 

The sample values x lt X 2 , ,x H will be a sequence of Os and Is, and the 
likelihood function is 


and if we let 

we obtain 

and 


Up) - ft pV"*‘ ~ 


♦log Up) = y log p + in -y) log q 


d log Up) __ y » - y 
dp ~~ p~ q 


remembering that q = I — p On putting this last expression equal to 0 
and solving for p, we find the estimate ^ 

( 3 ) 


which is intuitively what the estimate for this parameter should be It is 
also a method of moments estimate For n - 3, let us sketch the likeh 
hood function Note that the likelihood function depends on the x/s 
only through £ Xj, thus the likelihood function can be represented by the 
following four curves 

Lo=X(p,Z^i«0)*0-p) s 

l i =I0>,£x,-1)»X1- p ) j 

U = Up,Y,x t ~l)~p\\-p) 

Li^Lip^Z* 1 = 3)“ P 3 » 

which are sketched in Fig 1 Note th3t the point where the maximum of 
each of the curves takes place for 0 £ p £ 1 is the same as that given in 
Eq (3) when n~3 //// 

• Recall that log x means log, x 
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Up) 



EXAMPLE 6 A random sample of size n from the normal distribution has 
the density 


1 


The logarithm of the likelihood function is 


Yl e " (1/2 * 2)(x '~*> 2 
yflna 


L * = " \ Iog 2n - 5 ]o S - 2^2 E (*« - Z^) 2 * 


where cr > 0 and — co < p < oo. 

To find the location of its maximum, we compute 


and 


dL * 1 v/ r 


<5cr 2 


n 1 1 r~i > ^2 

2? + 2? E()C ''' ,) - 


and on putting these derivatives equal to 0 and solving the resulting 
equations for p and a 2 , we find the estimates 

£ = ^X>i = * ( 4 ) 

which turn out to be the sample moments corresponding to p and <r 2 . t 

mi . 
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EXAMPLE 7 Let the random variable X have a uniform density given by 

where -co < 0 < oo, that is, § = teal line The likelihood function 
for a sample of size n is 

i(0, x u f *«) = o ) « nvt «*d x t) 

- *[>„-* (Q 

where y t is the smallest of the observations and y„ is the largest The 
last equality in Eq (6) follows since J^[ / [9 _ j s+jj(xi) is unity if and only 

if all x lt , x„ are in the interval [0 — 0 + i], which is true if and 

only if and y„£0 + i, which is true if and only if 

A — i ^ 0 ^ i We see that the likelihood function is either 1 

(for y„ — 1 £ 0 ^ yt + $) or 0 (otherwise) , hence any statistic with value 

9 satisfying y„ - i £ 0 ^ y x + } is a maximum likelihood estimate 
Examples are y n - i, yi + i, and { (y t + y„) This latter is the midpoint 
between y n — i and y 1 or the midpoint between y t and y„, the 
smallest and largest observations HU 


EXAMPLE 8 Let the random variable X have a uniform distribution with 
density given by 

/(*, B) =/(*, ft, c) = / u ^, „ w5 ,£c), 

where — co < ft < co and <r > 0 (Recall Example 4 ) Here the likeli 
hood function for a sample of size n is 

L(p, <7, Xu . *„) - ( 2 ^)",riW3. , + V5.](*i> 

“ ho-st* rJyOhn i.+v5-i(y») 

■fey 

where y, is the smallest of the observations and y„ is the largest The 
likelihood function is (2^3 a)~ H m the shaded area of Fig 2 and 0 else 
where (2^/30)“" within the shaded area is dearly a maximum when <r 
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is smallest, which is at the intersection of the lines n~ ^3a = y 1 and 
H + y/3 cr — y„ . Hence the maximum-likelihood estimates of // and a are 

£ = ^(yi + y„) (7) 

and 

which are quite different from the method-of-moments estimates given in 
Example 4. //// 


The above four examples are sufficient to illustrate the application of the 
method of maximum likelihood. The last two show that one must not always 
rely on the differentiation process to locate the maximum. 

The function L{6) may, for example, be represented by the curve in Fig. 3, 
where the actual maximum is at 0, but the derivative set equal to 0 would locate 
6' as the maximum. One must also remember that the equation dL/80 = 0 
locates minima as well as maxima, and hence one must avoid using a root of 
the equation which actually locates a minimum. 

We shall see in later sections (especially Sec. 9 of this chapter) that the 
maximum-likelihood estimator has some desirable optimum properties other 
than the intuitively appealing property that it maximizes the likelihood function. 
In addition, the maximum-likelihood estimators possess a property which is 
sometimes called the invariance property of maximum-likelihood estimators . A 
little reflection on the meaning of a single-valued inverse will convince one of 
the validity of the following theorem. 



284 PARAMETRIC POINT ESTIMATION 


TO 



Theorem 1 Invariance property of maximum likelihood estimators Let 
0 = S(ti, X 2 , X„) be the maximum likelihood estimator of 0 in the 
density f(x 6) where 9 is assumed umdimensional If t( ) is a function 
with a single-valued inverse, then the maximum likelihood estimator of 
T{0) is t(0) III I 

For example in the normal density with p 0 known the maximum likelihood 
estimator of a 2 is 

By the invariance property of maximum likelihood estimators the maximum 
likelihood estimator of <r is 

Similarly, the maximum likelihood estimator of say log a 2 is 

I°g[“|f W - Po) 1 ] 

The invariance property of maximum likelihood estimators that is exhih* 
ited m Theorem 1 above can and should be extended Following Zehna [43] 
we extend in two directions First, 9 will be taken as k dimensional rather 
than umdimensional, and second the assumption that r( ) has a single valued 
inverse will be removed It can be noted that such extension is necessary by 
considering two simple examples As a first example, suppose an estimate of 
the variance, namely 6(1 — 6), of a Bernoulli distribution is desired Example 
5 gives the maximum likelihood estimate of 0 to be x, but since 6(1 — 0) is not 
a one to one function of 0, Theorem 1 does not give the maximum likelihood 
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estimator of 0(1 0). Theorem 2 below will give such an estimate, and it will 

be jc(1 — x ). As a second example, consider sampling from a normal distribu- 
tion where both /i and a 2 arc unknown, and suppose an estimate of S[X 2 ] = 
n 2 + a 2 is desired. Example 6 gives the maximum-likelihood estimates of p 
and <x 2 , but p + <x 2 is not a one-to-one function of p and o 2 , and so the 
maximum-likelihood estimate of p 2 + a 2 is not known. Such an estimate will 
be obtainable from Theorem 2 below. It will be x 2 + (l[n) £ (*, - x) 2 . 

Let 0 = ((?!, 0*) be a /.'-dimensional parameter, and, as before, let 0 

denote the parameter space. Suppose that the maximum-likelihood estimate 
of t ( 0) = (Tj (0), .... x,(0)), where 1 < r < /:, is desired. Let T denote the range 
space of the transformation x(-) = (r,(-), ..., r,(-))- T is an r-dimensional 
space. DefineM^Xj xj= sup 1(0; x u ..., x„). M(*; x^ ..., x„) 

(8:t(6) = t) 

is called the likelihood function induced by '(')■* When estimating 0 we max- 
imized the likelihood function 1(0; ..., x„) as a function of 0 for fixed 

x,, ..., x„; when estimating r = t ( 0) we will maximize the likelihood function 
induced by x(-), namely M(r; x,, . . . , x B ), as a function of r for fixed x t , . . ., x„ . 
Thus, the maximum-likelihood estimate of x = r (0), denoted by ■?, is any 
value that maximizes the induced likelihood function for fixed x u . . ., x n ; that 
is, t is such that M(i ; x, , . . . , x„) ;> M (x ; Xj , . . . , x„) for all t e T. The invari- 
ance properly of maximum-likelihood estimation is given in the following 
theorem. 


Theorem 2 Let & = ©*), where &j = .... X n ), be a 

maximum-likelihood estimator of 0 = (0 lt ..., 0*) in the density 
/(• ; 0„ . . 0*). If r(0) = (t,(0), . . • , r,(0)) for 1 ^ r £ k is a transforma- 
tion of the parameter space S, then a maximum-likelihood estimator of 
r(0) = (x,(0), . . . , x r (0)) is x(0) = (rfO), t r (0)). [Note that x/0) = 
x/0 1( ..., 0 L ); so the maximum-likelihood estimator of x/0„ ..., 0*) is 

Xj(©j, ..., “ i> "•> >••] 


proof Let 0 = (0], ..., 0*) be a maximum-likelihood estimate of 
0 = (0 1 , ..., 0*). It suffices to show that M(r(§); x u .... x„) ^ 
M(x; Xj, .... x„) for any x 6 T, which follows immediately from the in- 
equality Af(x; x,, . . . , x„) = sup L(0;x 1 ,...,x„)^sup £(0;xi,...,x n ) 

|8:r(0) = t) OeS 

= L0; x, x n ) = sup L{0;x lt ...,x„) = M(x(0);x 1 ,...,x„). //// 

{0: t(0)»r(0)) 


♦The notation "sup” is used here, and elsewhere in this book, as it is usually used in 
mathematics. For those readers who are not acquainted with this notation, not much 
is lost if "sup" is replaced by "max,” where max is an abbreviation for maximum. 
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It is precisely this property of invariance enjoyed by maximum likelihood 
estimators that allowed us in our discussion of maximum likelihood estimation 
to consider estimating (0 S , % 0 t ) rather than the more general r v (0j, , 

. %{0 U , 0 *) 


EXAMPLE 9 In the normal density, let 0 = (0j, 0 2 ) = 0 f . a 1 ) Suppose 
r{0) = ft + z f <r, where ? f is given by <t>(zj = q t(0) is the <?th quantile. 
According to Theorem 2> the maximum likelihood estimator of t(0) is 

* + tjlmziXi - W mi 


2 3 Other Methods 

There are several other methods of obtaining point estimators of param 
eters Among these are (i) the method of least squares, to be discussed in 
Chap X (n) the ftojej method, to be discussed later in this chapter, (ui) the 
muilnum-chi square method and (iv) the minimum-distance method In this 
subsection wc will briefly consider the last two Neither will be used again in 
this book 

Minlmum-chl-square method Let -IT, , X, be a random sample from a 
density given by f x (x 0), and let be a partition of the range of X 

The probability that an observation falls in cell Sfjyj *= I. , k, denoted by 
p/fi) can be found For instance if f x (\ 0) is the density function of a con- 
tinuous random variable then p/P) = P[X falls in cell £fj\ * [ y) f x {x, 0)dx 

Note that £ p/p) = l Let the random variable Sj denote the number of 

X,'s in the sample which falls in cell , k, then **», 

sample size Form the following summation 

yl v h-wM 2 

1 "jt'i ’ 

•where n y is a value of N y The numerator of the yth term in the sum is the square 
of the difference between the observed and the expected number of observations 
falling in cell Sdj The minimum chi square estimate of 0 is that 0 which 
minimizes y 1 It is that 0 among all possible 0 s which makes the expected 
number of observation mcellS'y 'nearest" the observed number Themimmum- 
cht square estimator depends on the partition Sf u , !f k selected 
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EXAMPLE 10 Let X \ n be a random sample from a Bernoulli distri- 
bution; that is, /*(*; 0) = 0*(1 - 0)'~* for * = 0, 1. Take ty = the 
number of observations equal to j for j = 0, 1. Here the range of the 
observation X is partitioned into the two sets consisting of the numbers 0 
and 1 respectively. 

2 = v \!h z H ZWf - [»o ~ "O - 0 )] 2 («t - »0) 2 

}= o I’PjW n(l - 0) 11 o 

_ -W, -»U -0)1 2 (rt| - nO) 2 _ (n, - ji0) 2 I 

"(1 - o ) nO ~ II 0(1 - oj ■ 

The minimum of y 2 us a function of 0 can be found by inspection by 

noting that y 2 =0 for 0 — njil. Hence 0 = njn. For this example 

there was only one choice for the partition S p l Sf k . The estimator 

found is the same as what would be obtained by cither the method of 
moments or maximum likelihood. //// 


Often it is difficult to locate that (7 which minimizes y 2 : hence, the denomi- 
nator np0) is sometimes changed to iij (if n i = 0, unity is used) forming a 

k 

modified x 2 = £, U ;/ y “ n Pj(P)] 2 l fi j}- The modified minimum-chi-square estimate 
of 0 is then that 0 which minimizes the modified x 2 * 

Minimum-distance method Lc( A',, ..., A',, he a random sample from the 
distribution given by the cumulative distribution function F x {x; 0 ) = F(x; 0), 
and let J(F, G) be a distance f unction that measures how 41 far apart ” two cumula- 
tive distribution functions F and G are. An example of a distance function is 
d(F , G) = sup |F(aO - C(.v)|, which is the largest vertical distance between F 

X 

and G. Sec Fig. 4. 

The minimum-distance estimate of 0 is that 0 among all possible 0 for 
which d(F(x; 0), F n (x)) is minimized, where F„( x) is the sample cumulative 
distribution function. Thus, 0 is chosen so that F(x; 0) will be “closest'’ to 
F„(j c), which is desirable since we saw in Subscc. 5.4 of Chap. VI that for a 
fixed argument x the sample cumulative distribution function has the same 
distribution as the mean of a binomial distribution; hence, by the law of large 
numbers F n (x) “converges” to F(x :). The minimum-distance estimator might be 
intuitively appealing, but it is almost always difficult to find since locating 
0 which minimizes d(F{x; 0), F n {x)) is seldom easy. The following example is 
an exception. 
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EXAMPLE 1 1 Again let X it , X„ be a random sample from a Bernoulli 
distribution, then 

F(x, Q) - (1 - 0)/ £O „(*) + 7 t , «,(*). 
where 0 ^ 0 £ 1 Let itj = the number of observations equal to ],j - 0 1 

Then 

F „(•*) * — f[o 1)W + ^i ■»>(*) 

Now if the distance function d(F, G) - sup jF(x) — G(x)| is used, then 

d(F( x 6 ) F„(Jt)) is minimized if 1 - 0 is taken equal to n 0 fn or 0 = njn =* 
5>,/n Hence D = x //// 

For a more thorough discussion of the minimum-ch) square method see 
Cramer 111] or Rao [17] The minimum-distance method is discussed m 
Wolfowitz [42] 


3 PROPERTIES OF POINT ESTIMATORS 

We presented several methods or obtaining point estimators in the preceding 
section. All the methods were arrived at on a more or less intuitive basis The 
question that now arises is Are some* of many possible estimators better, in 
some sense, than others? In this section we will define certain properties, which 
an estimator may or may not possess that will help us in deciding whether one 
estimator is better than another 

3 1 Closeness 

If we have a random sample X lt , X t from a density, say f{x, 0), which is 
known except for 0 then a point estimator of r(0) is a statistic, say , -TJ, 

whose value is used as an estimate of x(0) We will assume here that t(0) is a 
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real-valued (not a vector) function of the unknown parameter 0 . [Often x(0) 
will be 0 itself.] Ideally, we would like the value of /(X t , . . X n ) to be the 
unknown t(0), but this is not possible except in trivial cases, one of which follows. 


EXAMPLE 12 Assume that one can sample from a density given by 

fi x > 0) = A*-!.* +£)(*)» 

where 0 is known to be an integer. That is, S, the parameter space, 
consists of all integers. Consider estimating 0 on the basis of a single 
observation x x . If /(x x ) is assigned as its value the integer nearest then 
the statistic or estimator /(A^) will always correctly estimate 0. In a 
sense, the problem posed in this example is really not statistical since one 
knows the value of 0 after taking one observation. //// 


Not being able to achieve the ultimate of always correctly estimating the 
unknown r (0), we look for an estimator /(X t , . .., X n ) that is “close” to t( 0). 
There are several ways of defining “close.” T = /( X i, ...» AQ is a statistic 
and hence has a distribution, or rather a family of distributions, depending on 
what 0 is. The distribution of T tells how the values / of T are distributed, and 
we would like to have the values of Tdistributed near t (0); that is, we would like 
to select •) so that the values of T — /{X u AQ are concentrated 

near r(0). We saw that the mean and variance of a distribution were, respec- 
tively, measures of location and spread. So what we might require of an 
estimator is that it have its mean near or equal to x(0) and have small variance. 
These two notions are explored in Subsec. 3.2 below and then again in Sec. 5. 

Rather than resorting to characteristics of a distribution, such as its mean 
and variance, one can define what “concentration ” might mean in terms of the 
distribution itself. Two such definitions follow. 

Definition 4 More concentrated and most concentrated Let T = 
/{X u ... 9 X n ) and V = i'(X i, X n ) be two estimators of t(0). T is 
called a more concentrated estimator of t( 0) than T if and only if 
P e [x(0) - X < T < t(0) + 2] £ PoW) — A < T < t(0) + X\ for all A > 0 
and for each 0 in §. An estimator T* = J*(Xi, X n ) is called most 
concentrated if it is more concentrated than any other estimator. //// 

Remark The subscript 0 on the probability symbol P 0 [*] is there to 
emphasize that, in general, such probability depends on 0. For instance, 
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in P t [z(ff)-X<T£t(0) + A], the cvent w©-^<rst (0) + Jl} is 
described in terms of the random variable T, and, in general, the distn 
butiOQ of r is indexed by 0 {/// 

We see from the definition that the property of most concentrated is 
highly desirable (Pitman [41], in defense of his calling a most concentrated 
estimator best, stated that such an estimator is “undeniably best"), unfortu- 
nately, most concentrated estimators seldom exist There are just too many 
possible estimators for any one of them to be most concentrated What is then 
sometimes done is to restrict the totality of possible estimators under con- 
sideration by requiring that each estimator possess some other desirable property 
and to look for a best or most concentrated estimator in this restricted class. 
Wt will not pursue the problem of finding most concentrated estimators, evtn 
within some restricted class, in this book 

Another criterion for comparing estimators is the following one 

Definition 5 Pitman-closer and Pitman-closest Let T » /(X lt .. , X.) 
and T' =/ (ATi, , A',) be two estimators of t(0) T' is called a Pitman* 
closer estimator of r(0) than T if and only if 

P,l)T - 1(0)| < |T - t{ 0))] Se i for each 0 m g 
An estimator T* is called Pitman closest if it is Pitman closer than any 
other estimator //// 

The property of Pitman-closest is, like the property of most concentrated, 
desirable, yet rarely will thercexist a Pitman-closest estimator Both Pitman-closer 
and more concentrated are intuitively attractive properties to be used to com- 
pare estimators, yet they are not always useful Given two estimators T and 
T', one does not have to be more concentrated or Pitman-closer than the other 
What often happens is that one, say T, is Pitman-closer or more concentrated 
for some 0 in g, and the other V is Pitman-cioser or more concentrated for 
other 0 in §, and since 0 is unknown, we cannot say which estimator is preferred. 
Since Pitman-closest estimators rarely exist for applied problems, we will not 
devote further study to the notion in this book, instead, we will consider other 
ways of measuring the closeness of an estimator to t (0) 

Competing estimators can be compared by defining a measure of the close- 
ness of an estimate to the unknown t(0) An estimator T « ?(X x , .... X.) 
of t( 0) will be judged better than an estimator T « /(AT„ .... AT,) if the measure 
of the closeness of T' to t{ 0) indicates that T is closer to t ( 0) than T Such 
concepts of closeness will be discussed m Subsecs 3-2 and 3 4 
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In the above we were assuming that /?, the sample size, was fixed. Still 
another meaning can be affixed to “closeness” if one thinks in terms of increasing 
sample size. It seems that a good estimator should do better when it is based 
on a large sample than when it is based on a small sample. Consistency and 
asymptotic efficiency are two properties that are defined in terms of increasing 
sample size; they are considered in Subsec. 3.3. Properties of point estimators 
that are defined for a fixed sample size are sometimes referred to as small-sample 
properties, whereas properties that are defined for increasing sample size are 
sometimes referred to as large-sample properties. 

3.2 Mean-squared Error 

A useful, though perhaps crude, measure of goodness or closeness of an esti- 
mator /(A'j, X n ) of t (0) is what is called the mean-squared error of the 
estimator. 

Definition 6 Mean-squared error Let T ~ /{X u ..., A^) be an 
estimator of t(0). ^[[T — t (0)] 2 ] is defined to be the mean-squared error 
of the estimator T = /(X l9 . . . , X n ). //// 

Notation Let MSE/0) denote the mean-squared error of the estimator 
T = /(A'j, . . . , X n ) of t(0). HU 

Remark The subscript 0 on the expectation symbol $ Q indicates from 
which density in the family under consideration the sample came. That is, 

<? o [[r-T(0)] 2 ] 

=s 0 mx u ...,x n )-x{o)n 

=J---J[/(x,,...,x n )- T m 2 Rxi ; 0) • • •/(*.; 0) </*,••• dx n , 

where f(x\ 6 ) is the probability density function from which the random 
sample was selected. //// 

The name “mean-squared error” can be justified if one first thinks of 
the difference t — r(0), where t is a value of T used to estimate x (0), as the error 
made in estimating t(0), and then interprets the “mean” in mean-squared 
error” as expected or average. To support the contention that the mean- 
squared error of an estimator is a measure of goodness, one merely notes that 
£ e [[T - t(0)] 2 ] is a measure of the spread of T values about t(0), just as the 
variance of a random variable is a measure of its spread about its mean. If we 
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FIGURE 5 



were to compare estimators by looking at their respective mean squared errors, 
naturally we would prefer one with small or smallest mean squared error We 
could define as best that estimator with, smallest mean squared error, but such 
estimators rarely exist In general, the mean squared error of an estimator 
depends on 0 

For any two estimators T t = , A'J and T 2 = of 

t(0) their respective mean squared errors MSE, (0) and MS£ /: (0) as functions 
of 0 are likely to cross, so for some 0, /j has smaller MSB, and for others / 2 has 
smaller MSE We would then have no basis for preferring one of the estimators 
over the other See Fig 5 

The following example shows that except in veiy tare cases an estimator 
with smallest mean squared error will not exist 


EXAMPLE 13 Let X lt , X u be a random sample from the density f{x ff) 
where 0 is a real number, and consider estimating 0 itself, that is, t(0) = $ 
We seek an estimator, say T* ~ t*{X u , XX such that £ 

MS E/0) for every 6 and for any other estimator T = /(X it , Jf,) of 9 

Consider the family of estimators T So = / ea (X lt , X„) a 9 0 indexed by 
0 O for 0 q e C> For each 0 O belonging to 0, the estimator T ti ignores the 
observations and estimates 9 to be 8 0 Note that 

MSE„ a (0 o ) = ^[[/ t0 (X u .A'j-en 

= Wo-e) I ] = (0 o -0) J t 

so MSE /4 * o ( 0 o ) = 0, that is, the mean squared error of evaluated at 
0 *= 0 O is 0 Hence, if there is to exist an estimator T* - , Xj 

satisfying MSE < .(0) <, MS E/0) for every 0 and for any estimator 
/ MSE/.(0) s 0 {For any Q 0 , MSE,.(0 o )=O since MSE<.(0 o )£ 
MSE /# *(0 o ) = 0 ) In order for an estimator to have its mean 
squared error identically 0, it must ahvays estimate 9 correctly, which 
means that from the sample you must be able to identify the true parameter 
value //// 
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One reason for being unable to find an estimator with uniformly smallest 

mean-squared error is that the class of all possible estimators is too large 

it includes some estimators that arc extremely prejudiced in favor of particular 0. 
For instance, in the example above / to (X u .... Af„) is highly partial to 0 o since 
it always estimates 0 to be 0 o . One could restrict the totality of estimators by 
considering only estimators that satisfy some other property. One such 
property is that of unbiasedness. 

Definition 7 Unbiased An estimator T = i(X u .... X n ) is defined to 

be an unbiased estimator of 7(0) if and only if 

= <?«[/( A', , . . . , XJ] m t (0) for all 0eS. //// 

An estimator is unbiased if the mean of its distribution equals r (0), the 
function of the parameter being estimated. Consider again the estimator 
/< 0 (A'j, X r ) & Q 0 of the above example; € e [f ea (X u .... X„)] = S c [0 o ] = 
0 o st 0; so / ta ( A'j,..., A'«) is not an unbiased estimator of 0. If we restricted the 
totality of estimators under consideration by considering only unbiased estima- 
tors, we could hope to find an estimator with uniformly smallest mean-squared 
error within the restricted class, that is, within the class of unbiased estimators. 
The problem of finding an unbiased estimator with uniformly smallest mean- 
squared error among all unbiased estimators is dealt with in Sec. 5 below. 

Remark 

MSE 10 ) » var [T] + (7(0) - d 0 (T]) 2 . (9) 

So if 7" is an unbiased estimator of 7(0), then MSE,(0) = var [TJ. 

PROOF 

MSE M = d t {{T - 7(0)f] = S e [((T - d 0 (T])~ (r (0) - d e [T]}) 2 ] 

= *,[ (T - d 0 (n 2 ) - 2(7(0) - 6o(T))So(T - d e (T)) 

+ d t [{x(0) - S e (T]} 2 ] = var [T] + {r (0) - S e lT)} 2 . //// 

The term 7(0) - d 0 [T] is called the bias of the estimator T and can be 
either positive, negative, or zero. The remark shows that the mean-squared 
error is the sum of two nonnegative quantities; it also shows how the mean- 
squared error, variance, and bias of an estimator arc related. 


EXAMPLE 14 Let A'j, .... A'j, be a random sample from f(x; 0) = 4>^,Ax). 
Recall that the maximum-likelihood estimators of // and c 2 are, respec- 
tively, X and (1 In) £ (A', - X) 2 . (See Example 6.) Now £ 0 [X] = ft; so 
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X is an unbiased estimator of ft, and hence the mean squared error of 
X = d|[f£ - rf*] = var [X] = (T I /« We know that *r l [S I ] = tr , ) $ 0 

um z (x, - *)*] - k« - w.tn /(« - oi I « - m 

= [(n-l)/«K ( [S J ] = [(«-l)W<T 2 

Hence the maximum likelihood estimator of tx 2 is not unbiased The 
mean squared error of (1/fl) £ (X t — X) 2 is 

mVn)X(Xi-X)* -*']*} 

= var [(!/«) Z (*i - *) 2 ] + (* 2 - Z (X, - UV 

= var [S 2 1 + (o 2 -^-o 2 ) 2 


«-3 A o 4 
— a ) + n>' 


using Eq (10) of Theorem 2 in Chap VI 


Remark For the most part, in the remainder of this book we will take 
the mean squared error of an estimator as our standard in assessing the 
goodness of an estimator //// 


3 3 Consistency and BAN 

In the previous subsection we defined the mean squared error of an estimator 
and the property of unbiasedness Both concepts were defined for a fixed 
sample size In this subsection we will define two concepts that are defined 
for increasing sample size In our notation for an estimator of r(fl), let us use 
T„ = i n {X l , XJ, where the subscript n of i indicates sample size Actually 
we will be considering a >equence of estimators, say Tj = t&Kxf, T 2 - 
T 3 = f 3 {X it A’j, X 3 ) , T h = , X„), An obvious example is 

r„ - , X^ = X„ — (l(rt) £ X f Ordinarily the functions /, in the 

sequence wdl be the same kind of function for each n 

When considering z sequence of estimators, it seems that a good sequence 
of estimators should be one for which the values of the estimators tend to get 
closer to the quantity being estimated as the sample sizeincreases The following 
definitions formalize this intuitively desirable notion oflinuting closeness 

Definitions Mean-squared-error consistency Let T t , T 2 , , T n 

be a sequence of estimators of t(0), where T, = f M {X t , , X,) is based 
On a sample of size n This sequence of estimators is defined to be a 
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mean-squared-error consistent sequence of estimators of r(0), if and only if 
lim d 0 [[T n — t(0)] 2 ] = 0 for all 0 in 0. //// 

n-» oo 

Remark Mean-squared-error consistency implies that both the bias 
and the variance of T„ approach 0 since £ c [[T n - t( 0)] 2 ] = var [T„] 
+ {-(0) - S 0 [T„]} 2 . mi 


EXAMPLE 15 In sampling from any density having mean fi and variance 

n 

a 2 , let = (l/n) X X t be a sequence of estimators of n and S 2 = 

ft 

[\j(n- 1)] Yj iXt ~ %n) be a sequence of estimators of a 2 . d[(X„ - fi) 2 ] = 

l « 1 

var [X'nl — c 2 /n -> 0 as n oo ; hence the sequence {A'J is a mean-squared- 
error consistent sequence of estimators of //. 

<?[(S„ 2 - a 2 ) 2 ] = var [S 2 ] = l (/** - Jzf «*) - 0 

as oo t using Hq. (10) of Chap. VI; hence the sequence {S*} is a 
mean-squared-error consistent sequence of estimators of a 1 . Note that if 
T fl = (l/n) £ (X t — X) 2 , then the sequence {TJ is also a mean-squared- 
error consistent sequence of estimators of a 2 . //// 


There is another weaker notion of consistency given in the following 
definition. 


Definition 9 Simple consistency Let T u T 2 , - . . , T n , • . . be a sequence 
of estimators of x (0), where T n = X n ). The sequence {T n } is 

defined to be a simple (or weakly) consistent sequence of estimators of x(6) 
if for every c > 0 the following is satisfied: 

lim P g [i(0) - e < T„ < t( 0) + e] = I for every 0 in 0. //// 

n-»co 


Remark If an estimator is a mean-squared-error consistent estimator, 
it is also a simple consistent estimator, but not necessarily vice versa. 


PROOF 


P o [t(0) -£</„< t(0) + e] = P[\T n -t(0 )\< £] 
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Vtt 


by the Chebyshev inequality As n approaches infinity, 8 § \\T^ - 
approaches 0, hence lim Pe[x(0) -e<T„< z(ff) + e] = 1 //// 

We close this subsection with one further large sample definition 

Definition 10 Best asymptotically normal estimators (BAN estimators) 
A sequence of estimators Tj, , Tj, of t(fi) is defined to be best 
asymptotically normal (BAN) if and only if the following four conditions 
are satisfied 

(i) The distribution of y/n[T* — r(0)] approaches the normal 
distribution with mean 0 and variance o*'(0) as n approaches infinity 
(n) For every t > 0, 

Jim P„[| Ti - T(fl» > e) = 0 for each 9 in g 

(in) Let {T,} be any other sequence of simple consistent estimators 
for which the distribution of >fn[T m — r(0)] approaches the normal dis 
tribution with mean 0 and variance 

(iv) a\0) is not less than a**(0) for all 0 in any open interval //// 


Remark The abbreviation BAN is sometimes replaced by CANE, 
standing for consistent asymptotically normal efficient //// 


The usefulness of this definition derives partially from theorems proving 
the existence of BAN estimators and from the fact that ordinarily reasonable 
estimators are asymptotically normally distributed 

It can be shown that for samples drawn from a normal density with 

mean y and variance a 1 the sequence T* — (l/h)£ JT| ** for n <= I, 2, 

i»i 

is a BAN estimator of y In fact the limiting distribution of - y) is 
normal with mean 0 and variance a 3 , and no other estimator can have smaller 
limiting variance in any interval of y values However, there are many other 
estimators for thi problem which are also BAN estimators of y, that is, esti- 
mators with the same normal distribution in the limit For example, 


V I ill*'' 


is a BAN estimator of y BAN estimators are necessarily weakly consistent 
by (n) of the definition 
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3,4 Loss and Risk Functions 

In Subsec. 3.2 we used mean-squared error of an estimator as a measure of the 
closeness or the estimator to ?(fl). Other measures are possible, for example, 

called the mam absolute deviation. In order to exhibit and consider still other 
measures of closeness, we will borrow and rely on the language of decision 
theory. On the basis of an observed random sample from some density 
function, the statistician has to decide what to estimate t ( 0) to be. One might 
then call the value of some estimator T » /{X t , .... XJ a decision and call the 
estimator itself a decision function since it tells us what decision to make. Now 
the estimate t of x (0) might be in error; if so, some measure of the severity of 
the error seems appropriate. The word "loss’’ is used in place of "error,” 
and "loss function" is used as a measure of the "error." A formal definition 
follows. 

Definition 11 Loss function Consider estimating x((l). Let t denote 
an estimate of x(0). The loss function, denoted by f(t; 0), is defined to 
be a real-valued function satisfying (i) /(r ; 0) £ 0 for all possible estimates 
t and all 0 in 5 and (ii) /{l; 0) — 0 for t — t (0). d(t; 0) equals the loss 
incurred if one estimates t(0) to be t when 0 is the true parameter value. //// 

In a given estimation problem one would have to define an appropriate 
loss function for the particular problem under study. It is a measure of the 
error and presumably would be greater for large error than for small error. We 
would want the loss to be small; or, stated another way, we want the error in 
estimation to be small, or we want the estimate to be close to what it is estimating. 


EXAMPLE 16 Several possible loss functions arc: 

(0 /,(r.0)-Ir-x(d)] J . 

(ii) / a (<;fl)« jl-t(0)|. 

- M if If - T(0)i > £ 

On) < j(f. V) - |q if j / — t(0)| <, c, where A > 0, 

<iv) ^(f;0) = p(0)|t-T(0)| r for p{0) £ Oandr > 0. 

f, is called the squared-error loss function, and r'j is called the absolute- 
error loss function. Note that both /, and f 2 increase as the error t - t(0) 
increases in magnitude. A, says that you lose nothing if the estimate t 
is within e units of r(0) and otherwise you lose amount A. is a general 
loss function that includes both C\ and t 2 as special eases. //// 
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We assume now that an appropriate loss function has been defined for our 
estimation problem, and we think of the loss function as a measure of error 
or loss Our object is to select an estimator T *=■ ({X u , X,) that makes 
this error or loss small (Admittedly, we are not considering a very important, 
substantive problem by assuming that a suitable loss function is given in 
general selection of an appropriate loss function, is not trivial) The loss 
function in its first argument depends on the estimate t, and t is a value of the 
estimator T, that is, t = /(*,, , *„) Thus, our loss depends on the sample 

.Xi , X n We cannot hope to make the loss small for every possible sample, 
but we can try to make the loss small on the average Hence, if we alter our 
objective of picking that estimator that makes the loss small to picking that 
estimator that makes the aierage loss small, we can remove the dependence of 
the loss on the sample X lt , X„ This notion is embodied in the following 
definition 

Definition 12 Risk function For a given loss function ({ , ), the risk 
function, denoted by Stfff), of an estimator T - J(X it , X.) is defined 
to be 

00 ) 

till 

The risk function is the aierage loss The expectation in Eq (10) can be 
taken m two ways For example, if the density f(x, 0) from which we sampled 
is a probability density function, then 

* e V(T, 0)] = wwl , XX 0)) 

=| JV(4*i. .*«) 0) fl /(*i« 0) dx t 

Or we can consider the random variable T and the density of T We get 

where / T (<) ts the density of the estimator T In either case, the expectation 
averages out the values of x lt , x„ 


EXAMPLE 17 Consider the same loss functions given in Example 16 The 
corresponding risks are given by 

(i) £ 6 llT~- t( 0)] j ] our familiar mean squared error 
00 £«l] T - t(0)]] the mean absolute error 
(in) A P*[|T — t(0)[ > e] 

0v) (KWlT-mil tin 
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Our object now is to select an estimator that makes the average loss (risk) 
small and ideally select an estimator that has the smallest risk. To help meet 
this objective, we use the concept of admissible estimators. 

Definition 13 Admissible estimator For two estimators T t = 

•••> an d T 2 = ^ 2 (^ 1 * *•*> X n ), estimator i x is defined to be a 
better estimator than / 2 if and only if 

^(0)<^ /2 (0) for all 0 in 5 
and 


< @/ 2 (0) for at least one 0 in 5. 

An estimator T = /(X l9 . . . , is defined to be admissible if and only if 

there is no better estimator. //// 

In general, given two estimators t x and / 2 neither is better than the other; 
that is, their respective risk functions as functions of 0 , cross. We observed 
this same phenomenon when we studied the mean-squared error. Here, as 
there, there will not, in general, exist an estimator with uniformly smallest risk. 
The problem is the dependence of the risk function on 0. What we might do 
is average out 0 , just as we average out the dependence onx 1} x n when 
going from the loss function to the risk function. The question then is: Just 
how should 0 be averaged out? We will consider just this problem in Sec. 7 
on the Bayes estimators. Another way of removing the dependence of the risk 
function on 6 is to replace the risk function by its maximum value and compare 
estimators by looking at their respective maximum risks, naturally preferring 
that estimator with smallest maximum risk. Such an estimator is said to be 
minimax. 

Definition 14 Minimax An estimator f* is defined to be a minimax 

estimator if and only if sup 01 ^(G) < sup 0t/(O) for every estimator A //// 
0 0 

Minimax estimators will be discussed in Sec. 7. 


4 SUFFICIENCY 

Prior to continuing our pursuit of finding best estimators, we introduce the 
concept of sufficiency of statistics. In many of the estimation problems that 
we will encounter, we will be able to summarize the information in the sample 
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x it , x a That is, we will be able to find some function of the sample that 
tells us just as much about 9 as the sample itself Such a function would be 
sufficient for estimation purposes and accordingly is called a sufficient statistic 
Sufficient statistics are of interest in themselves, as well as being useful in 
statistical inference problems such as estimation or testing of hypotheses 
Because the concept of sufficiency is widely applicable, possibly the notion 
should have been isolated m a chapter by itself rather than buried in this chapter 
on estimation 

4 1 Sufficient Statistics 

Let X, , X„ be a random sample from some density, say/{ , 6) We defined 
a statistic to be a function of the sample, that is, a statistic is a function with 
domain the Tange of values that {X lt , XJ can take on and counterdom&m 
the real numbers A statistic T = S(X It , X M ) is also a random variable it 
condenses the n random variables X lt Xj , , X, into a single random variable 

Such condensing is appealing since we would xather work with unidimensjonal 
quantities than n dimensional quantities We shall be interested in seeing if we 
lost any “information by this condensing process The condensing can also be 
viewed another way Let X denote the range of values that (X u , A',) can 
assume For example if we sample from a Bernoulli distribution, then X is a 
collection of all n dimensional vectors with components either 0 or I, or if we 
sample from a normal distribution, then I is an n dimensional euclidean space 
Now a statistic induces or defines a partition of X (Recall that a partition of X 
is a collection of mutually disjoint subsets of X whose union is X ) Let 
/( , ) be the function corresponding to the statistic T = /(X t , , X,) 

The partition induced by /( , , ) is brought about as follows Let f a denote 

any value of the function that subset of X consisting of all those 

points , x„) for which , x H ) = f 0 is one subset m the collection 

of subsets which the partition comprises, the other subsets are similarly formed 
by considering other values of , , ) For example, if a sample of size 3 

is selected from a Bernoulli distribution, then X consists of eight points 
(0, 0, OJ, (0, 0, 1) (0, 1, 0), (1, 0, 0), (0, 1, 1), (1, 0, 1), (1, 1, 0), (1, 1,1) Let 
A x i, x j , * 3 ) = X) + x 2 + x Jt then /( , , ) takes on the values 0, 1, 2, and 3 
The partition of X induced by /( , , ), consists of the four subsets {(0, 0 , 0 )}, 
{(0 , 0 1), (0, 1 , 0), (1, 0, 0)}, {(0, 1, I), (1, 0, ]), (1, 1, 0)}, and {(1, 1, 1)} corre- 
sponding, respectively, to the four values 0, l , 2, and 3 of /( , , ) A statistic 
then is really a condensation of X In the above example, if we use the statistic 
At , ), we have only four different values to worry about instead of the eight 
different points of X 
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Several different statistics can induce the same partition. In fact, if 
^(*> • * * » *) * s a statistic, then any one-to-one function of t has the same partition 
as A In the example above l\x l9 x 2 , x 3 ) = 6[x x + * 2 + x 3 )\ or even 
^( x i> x 2 j * 3 ) = x i + x l + x l > induces the same partition as J(x u x 2 , x 3 ) = 
x i +x 2 + x 3 . One of the reasons for using statistics is that they do condense 
3E and if such is our only reason for using a statistic, then any two statistics with 
the same partition are of the same utility. The important aspect of a statistic 
is the partition of £ that it induces, not the values that it assumes. 

A sufficient statistic is a particular kind of statistic. It is a statistic that 
condenses £ in such a way that no “information about 0” is lost. The only 
information about the parameter 0 in the density /(* ; 0) from which we sampled 
is contained in the sample X l9 X„ ; so, when we say that a statistic loses no 
information, we mean that it contains all the information about 0 that is con- 
tained in the sample. We emphasize that the type of information of which we 
are speaking is that information about 0 contained in the sample given that we 
know the form of the density; that is, we know the function/(-; •) in/(’ ; 0), 
and the parameter 6 is the only unknown. We are not speaking of information 
in the sample that might be useful in checking the validity of our assumption 
that the density does indeed have form /(• ; ■)• 

Now we shall formalize the definition of a sufficient statistic; in fact, we 
shall give two definitions, namely, Definitions 15 and 16. It can be argued that 
the two definitions are equivalent, but we will not do it. 

Definition 15 Sufficient statistic Let X ly . . . , X n be a random sample 
from the density /(•; 0), where 0 may be a vector. A statistic S = 
o(X l9 X„ ) is defined to be a sufficient statistic if and only if the con- 
ditional distribution of X u X n given S = s does not depend on 0 for 
any value s of S. Ill I 

Note that we use S = o(X l9 ..., X n ) y instead of T = J(X U X„), to 
denote a sufficient statistic. Some care is required in interpreting the condi- 
tional distribution of X u . . . , X n given S = s, as Example 19 and the paragraph 
preceding it demonstrate. 

The definition says that a statistic S = o{X u X n ) is sufficient if the 
conditional distribution of the sample given the value of the statistic does not 
depend on 0. The idea is that if you know the value of the sufficient statistic, 
then the sample values themselves are not needed and can tell you nothing more 
about 0, and this is true since the distribution of the sample given the sufficient 
statistic does not depend on 0. One cannot hope to learn anything about 0 by 
sampling from a distribution that does not depend on 0. 
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EXAMPLE 18 Let X u X z , X z be a sample of size 3 from the Bernoulli 
distribution Consider the two statistics S = a(X t , X 2 , XJ « x t 
+ X 2 + X 3 and T = HX Z , X 2 , X } ) = X l X i + X 3 . We will show that 
<K » » *) « sufficient and /(-,*, •) is not This first column of Fig. fiisj. 



FIGURE 5 


The conditional densities given in the last two columns are routinely 
calculated For instance* 

SxuXi FtX t «0,X 2 *l;X, = 0iS» 

* 0, X, ~ i ; X z = 0; S= 1] 

ns- 1] 

(i - p) P {\ - P ) 1 
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and 


Aj.A' 2 ,Xj|r«o(0i 1,0|0) = 


f [*i = 0;^2 = i;Ar 3 = o;T = o] 
P[T = 0] 


0 ~ pfp = p p 

(1 - p) 3 + 2(1 - p) z p l-p+2p~l+p' 

The conditional distribution of the sample given the values of S is inde- 
pendent of p; so S is a sufficient statistic; however, the conditional distribu- 
tion of the sample given the values of Tdcpends on p; so Tis not sufficient. 
We might note that the statistic T provides a greater condensation of £ 
than docs S. A question that might be asked is: Is there a statistic which 
provides greater condensation of X than docs S which is sufficient as well? 
The answer is “no” and can be verified by trying all possible partitions of 
£ consisting of three or fewer subsets. //// 


In the case of sampling from a probability density function, the meaning 
of the term “the conditional distribution of A",, ..., X„ given S — s" that 
appears in Definition 15 may not be obvious since then P[S = j] = 0. We can 
give two interpretations. The first deals with the joint cumulative distribution 
function and uses Eq. (9) of Subsec. 3.3 in Chap. IV; that is, to show that 
S — ..., X n ) is sufficient, one shows that P[X x <, x ^, ...; X„ <: jr n |S = s] 

is independent of 0, where P[ A', <*,; ...; A'„ < *„ | S = .r] is defined as in 
Eq. (9) of Chap. IV. The second interpretation is obtained if a one-to-one 
transformation of X t , X 2 , .... X„ to, say, S, Y 2 , .... Y n is made, and then it is 
demonstrated that the density of Y 2 , . . . , given S = s is independent of 0. If 
the distribution of Y 2 , Y n given S = s is independent of 0, then the distribu- 
tion of S, y 2 , ..., Y„ given S = 5 is independent of 0 , and hence the distribution 
of*,, X 2 , ..., X n given S = s is independent of 0. These two interpretations 
arc illustrated in the following example. 


EXAMPLE 19 Let X } , .... X n be a random sample from/(-; 0) = <f > 0t t (-); 
that is, A'„ .... X„ is a random sample from a normal distribution with 
mean 0 and variance unity. In order to expedite calculations, we take 
n - 2. Let us argue that S = a(X u X 2 ) = Afj + X 2 is sufficient using 
the second interpretation above. The transformation of (Aj, Af 2 ) to 
(S, y 2 ), where S = Af 1 + Af 2 and Y 2 = X 2 -X t , is one-to-one; so it 
suffices to show that /rjisO'zk) is independent of 0. Now 


■/l'llsO'zl 5 ) 


fY 2 ,s(}'2' S ) _ fr&iVsis) 
/s(s) " m 


=fr 2 (y 2 ) 
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va 


(using the independence of X 2 + X 2 and X 2 — X t that was proved m 
Theorem $ of Chap VI), but 


* W -3E35‘*“" 


since 


y 2 ~ N(o, 2 ), 

which is independent of 6 

The necessary calculations for the first interpretation above are 
less simple We must show that P[X 2 :S x 2 1 S = j] « inde 

pendent of 9 According to Eq (9) of Chap IV, 

P[Xi S x 2 , X 2 £ x 2 1 S m j] = hm P[Xi £ x t , X 2 £ x 2 jj - h < S < s + A] 

Without loss of generality, assume that Jf| <, x 2 We have the following 
three cases to consider (i) s£x lt (ii) x t < s£x 2 , and (in) x 2 <s 
P[ATj ^ Jr lt x 2 jS - s] is clearly 0 (and hence independent of U) for 
case (m) Let us consider (i) [Case (n) is similar ] 

= IunPpfi ^ X 2 tS x 2 |j — A < S < s + A) 

A-0 

ltm ^ P[Xi :£ *i, X 2 S Xj,s - h < S < s + h) 

— h < S < 5 + h] 

a-»o 2 h 

hm^P[X l ^x t ,X 2 ^x 2 ,s^h<S<s + h) 

" AW 

Note that (see Fig 7) 

J™ % / itl L.WU* dv du 

^ s™ ^h P ^ Xl £ ^ x i> 5 ~ h < X t + X 2 < i + hi] 

, i, k A1MA1W dv du ' 
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and hence 


lim P[X, £x l -,X 2 £x 2 ;s-h<X i +X 2 <s + h] 
a~ o 2h 

= f ' /x,00Ai( s - «) *»• 

J 3-X2 


Finally, then, 


P[X i £ Xi] X 2 x 2 \S s] 

EU./r.(»)/x,(s -«)<*“ 

fj /2n')e -it " 1- </h 






which is independent of 0. 


//// 



306 TAJUMETOC POINT kttmation 


Definition 15 of a sufficient statistic is not very workable First, it does 
not tell us which statistic is likely to be sufficient, and, second, it requires w to 
derive a conditional distribution which may not be easy, especially for con- 
tinuous random iambics In Substc 4 2 below, we will present a criterion 
that may aid us in finding sufficient statistics 

Although we will not so argue the following definition is equivalent to 
Definition 15 


Definition 16 Sufficient statistic Let X lt , X n be a random sampl* 
from the density /( 8) A statistic S = o(X lt , X h ) is defined to be a 

sufficient statistic if and only if the conditional distribution of T given S 
does not depend on 0Jor any statistic T •= /(X it , //// 


Definition 16 is particularly useful in showing that a particular statistic 
is not sufficient For instance, to prove that a statistic Y = l {X t , , is 
not sufficient one needs only to find another statistic T = /(X t , , A'J for 

which the conditional distribution of T given T depends on 0 

For some problems no single sufficient statistic exists However, there 
will always exist jointly sufficient statistics 


Definition 17 Jointly sufficient statistics Let X lt , X, be a random 
sample from the density /( , 6) The statistics Sj, t S r are defined to be 
jointly sufficient if and only if the conditional distribution of X lt , X, 
given , S r = s, does not depend on 0 //// 


The sample X t , X„ itself is always jointly sufficient since the condi 
tional distribution of the sample given the sample does not depend on 0 Also 
the order statistics Y, , Y„ are jointly sufficient for random sampling If the 
order statistics are given, say, by (Y, = y t> , Y m = yj, then the only values 
that can be taken on by (Y, t , XJ are the permutations of y lt ,y t Since 
the sampling is random each of then 1 permutations is equally likely So given 
the values of the order statistics the probability that the sample equals a partic- 
ular permutation of these given values of the order statistics j$ I/nl, which is 
independent of 8 (Sufficiency of the order statistics also follows fromTheorera 
5 below) 

H we recall that the important aspect of a statistic or set of statistics is the 
partition of X that it induces, and not the values that it takes on, then the 
validity of the following theorem is evident 
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Theorem 3 If S, « *(*„ . . . , X n \ . . . , S r = „,(*!, . . . , jy is a set of 
jointly sufficient statistics, then any set of one-to-one functions, or trans- 
formations, ofS lf S r is also jointly sufficient. JU j 

For example, if an ^ are jointly sufficient, then X and 

“ ^) 2 = Z ~nX 2 are also jointly sufficient. Note, however, that 
X 2 and £ (A \ — X) 2 may not be jointly sufficient since they are not one-to-one 
functions of £ X t and £ X } . 

We note again that the parameter 0 that appears in any of the above three 
definitions of sufficient statistics can be a vector. 


4.2 Factorization Criterion 

The concept of sufficiency of statistics was defined in Definitions 15 to 17 above. 
In many cases, a relatively easy criterion for examining a statistic or set of 
statistics for sufficiency has been developed. This is given in the next two 
theorems, the proofs of which arc omitted. 

Theorem 4 Factorization theorem (single sufficient statistic) Let X l9 
X 2 , . . . , X n be a random sample of size n from the density /(• ; 0), where 
the parameter 0 may be a vector. A statistic S = <j(X u . . . , X n ) is sufficient 

n 

if and only if the joint density of A"„ . . . , X n , which is n/(*o ®)> factors as 

(=i 

fxi , '■*) X n ', 11) ~ • • • » %n ) » • • • j Xn) 

- S(s; 0)h(x lt .... ( 1 1 ) 

where the function h(x u ...,x„) is nonnegative and does not involve the 
parameter 0 and the function g(o(x „ x 0) is nonnegative and 

depends on x x , x n only through the function <?(*, •)• //// 

Theorem 5 Factorization theorem (jointly sufficient statistics) Let X t , 
X 2 , . . . , X„ be a random sample of size n from the density /( • ; 0), where 
the parameter 0 may be a vector. A set of statistics S 2 — a 2 (X 2 , X„), 
S r = 6&X U . . . , a;) is jointly sufficient if and only if the joint density 
of X x , . . . , X„ can be factored as 

fx i, ...,x n ( x i, ...,x„’,0) 

— g(. d l( x u ...» *n)> • • • > a r( x 1» •••♦*«)* &)K X 1> • * * * *«) 

— g(Sj , \ 0)h(x l , . . . , x„), 


( 12 ) 
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where the function A(*„ , x„) is nonnegative and does not involve th e 

parameter 0 and the function sfri, , 0) is nonnegativc and depends 

onXj, , x g only through the functions a t ( , , ), , <r,( , , ).//// 

Note that, according to Theorem. 3, there are many possible sets cf suffi 
dent statistics The above two theorems give us a relatively easy method for 
judging whether a certain statistic is sufficient or a set of statistics is jointly 
sufficient However, the method is not the complete answer since a particular 
statistic may be sufficient yet the user may not be clever enough to factor the 
joint density as in Eq (1 1) or (12) The theorems may also be useful in discover- 
mg sufficient statistics 

Actually the result of either of the above factorization theorems is in 
natively evident if one notes the following If the joint density factors as 
indicated in, say Eq (12), then the likelihood function is proportional to 
gfoi, , s„ 0) which depends on the observations x lt , x M only through 
<j lt , a r [the likelihood function is viewed as a function of 0, so h(x u , xj 
is just a proportionality constant] which means that the information about 
0 that the likelihood function contains is embodied in the statistics dj( , , ), 

.) 

Before giving several examples, we remark that the function A( , , ) 

appearing in either Eq (1 1) or (12) may be constant 


EXAMPLE 20 Let be a random sample from the Bernoulli density 

with parameter 0, that is, 

/(* 0) = fl*(l - fl) 1 -*/, 0 ,j(x) and 0 £ 6 £ 1 


Then 


( n/(*i. e) - ri/'u - 0)‘-"/ {e uoo 
= „(*,) 


If we take C“(l - as g(<s(x lt , x,), 0) and i,W w 

Kx i, , x„) and set c(x lt , x,) = £ x lf then the joint density of 
X„ f X„ factors as in Eq (1 ]), indicating that S = e(X lt ,XJ**£Xi 
is a sufficient statistic //// 
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EXAMPLE 21 Let X . • * , X n be a, random sample from the normal density 
with mean and variance unity. Here the parameter is denoted by n 
instead of 0. The joint density is given by 


n 

fx x£*i n*»Axd 

i~l 

- (2^35 “P [- J (I *f + "fi 1 )] 


If we take h(x u ...,x„) = [l/(2n) nl2 ] exp ( - ££x?) and g(rfx lf ...,x„);p) = 
exp [p X *, - (n/2)p 2 ], then the joint density has been factored as in 
Eq. (11) Within •••> *n) =Z*i; hence Z x i is a sufficient statistic. 
(Recall that X n is also sufficient since any one-to-one function of a suffi- 
cient statistic is also sufficient.) //// 


EXAMPLE 22 Let A^, ... , X„ be a random sample from the normal density 
<pp tC -. i(-). Here the parameter 6 is a vector of two components; that is, 
0 = 0*, o)- The joint density of X u . . . , X„ is given by 


n h.M = n -7=- ex p 

1=1 i=iy/2n<r 




' {2 n ) n > 2 


'exp -^- I (£xt-2{iY J Xi + np 1 ) ; 


so the joint density itself depends on the observations x lt ..., x„ only 

through the statistics <q(*i x n ) = £ and *i( x i> •••> *«) = 

that is, the joint density is factored as in Eq. (12) with h(x u x„) = l. 
Hence, £ X t and £ Xf are jointly sufficient. It can be shown that 
Z„ and S 2 = [l/(n - 1)] Z (X, - X) 2 are one-to-one functions of £ X, 
and X X, 2 ; so and S 2 are also jointly sufficient. //// 
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EXAMPLE 23 Let A',, .... X„ be a random sample from a uniform distribu- 
tion over the interval (0 S , 0jl The joint density of X lt ... X< is given by 

fxi. .*■>!** • • 01* ° 2 ) “ n 

rU, w “ fxd 

where 

y t = min [jc, jc„1 and y „ = max x n ] 

The joint density itself depends on je„ .... x„ only through y, and y,; 
hence it factors as in Eq (12) with h(x lt .... *„) «= 1. The statistics Yj 
and Y„ are jointly sufficient Note that if we take 0 i = 6 and 0 2 * 0 + l t 
then Y, and Y„ are still jointly sufficient. However, if we take 0, ■« 0 and 
B 2 = £>, then our factorization can be expressed as 

fx lt ,xS x ^ = 

Taking $«*„. ff) => (l/0*)/ IO , «(yj and /;(*„ .... *J = / IOl ,.j{yi), 
we see that Y„ alone is sufficient. //// 

The factorization criterion of Eqs (1 1) and (12) is primarily useful in 
showing that a statistic ot set of statistics is sufficient. It is not useful m 
proving that a statistic or set of statistics is not sufficient. The fact that we 
cannot factor the joint density does not mean that it cannot be factored; it could 
be that we are just not able to find a correct factorization. 

If we go back and look through our examples on maximum-likelihood 
estimators (see Examples 5 to 8), we will see that all the maximum-likelihood 
estimators that appear there depend on the sample X x ,..,,X M through sufficient 
statistics. This is not something that is characteristic of the relatively simple 
examples we had given but something that is true in general. 

Theorem 6 A maximum-likelihood estimator or set of maximum- 
Iikelihood estimators depends on the sample through any set of jointly 
sufficient statistics. 
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PROOF If s t *(*, ..., 2Q, . . . , S k - . . . , X n ) are jointly 

sufficient, then the likelihood function can be written as 


L(Q t •••) x n ) 

= ri/(*<;0) 

<=1 

= . . • , x n ) t . . . , a k (x u . . . , *„); 0)/i(* 1? . . . , x„). 

As a function of 0, £(0; x lf . . . , ,v n ) will have its maximum at the same 
place that g (s kf . , , , s k \ 0 ) has its maximum, but the place where g attains 
its maximum can depend on x u . .., x n only through s lf s k since g 
does. //// 


We might note that method-of-moment estimators may not be functions 
of sufficient statistics. Sec Examples 4 and 23. 


4.3 Minimal Sufficient Statistics 

When we introduced the concept of sufficiency, we said that our objective was to 
condense the data without losing any information about the parameter. We 
have seen that there is more than one set of sufficient statistics. For example, 
in sampling from a normal distribution with both the mean and variance un- 
known, we have noted three sets of jointly sufficient statistics, namely, the sample 
X u X„ itself, the order statistics Y u and X and S 2 . We naturally 

prefer the jointly sufficient set X and S 2 since they condense the data more than 
either of the other two. (Note that the order statistics do condense the data.) 
The question that we might ask is: Does there exist a set of sufficient statistics 
that condenses the data more than X and S 2 ? The answer is that there does 
not, but we will not develop the necessary tools to establish this answer. The 
notion that we are alluding to is that of a minimum set of sufficient statistics, 
which we label minimal sufficient statistics . 

We noted earlier that corresponding to any statistic is the partition of X 
that it induces. The same is true of a set of statistics; a set of statistics induces 
a partition of 3c. Loosely speaking, the condensation of the data that a statistic 
or set of statistics exhibits can be measured by the number of subsets in the 
partition induced by that statistic or set of statistics. If a set of statistics has 
fewer subsets in its induced partition than does the induced partition of another 
set of statistics, then we say that the first statistic condenses the data more than 
the latter. Still loosely speaking, a minimal sufficient set of statistics is then a 
sufficient set of statistics that has fewer subsets in its partition than the induced 
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partition of any other set of sufficient statistics So a set of sufficient statistics 
is minimal if no other set of sufficient statistics condenses the data more A 
formal definition is the following 

Definition 18 Minimal sufficient statistic A set of jointly sufficient 
statistics is defined to be minimal sufficient if and only if it is a function of 
every other set of sufficient statistics //// 

Like many definitions, Definition 18 is of little use in finding minimal 
sufficient statistics A technique for finding minimal sufficient statistics has 
been devised by Lehmann and Scheffe [19] but we will not present it If the 
joint density is properly factored, the factorization criterion will give us mammal 
sufficient statistics All the sets of sufficient statistics found in Examples 20 
to 23 are minimal 

4 4 Exponential Family 

Man? of the parametric families of densities that we have considered art 
members of what is called the exponential class, or exponential family , not to be 
confused with the negative exponential family of densities which is a special case 

Definition 19 Exponential family of densities A one parameter family 
(0 is unidimensional) of densities /( , 0) that can be expressed as 

fix 0) = fl(0)b(x) exp md(x)] (13) 
for —oo < x < co, for all 0 e 5 and for a suitable choice of functions 
n( ) f>( ) c() and d{ ) is defined to belong to the exponential family or 
exponential class lffl 


EXAMPLE 24 If /(*, 0) = 6e~ tx f {0 w) (x), tben/(x, 0) belongs to the expo- 
nential family for a(ff) *= 0, b(x) = 7 (0 „>(*), c(0) = -0, and d(x) = x in 

Eq (13) till 

EXAMPLE 25 If/(x, 0) =/(x, A) is the Poisson density, then 

i >(*) 

- 7 to i |W) ex P (* lo s4> 
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In Eq. (13), we can take <z(A) = b(x) = (1 /x!)/ (0 , ,_,(*). c(A) = log A, 
and d(x) — x; so f(x; A) belongs to the exponential family. //// 


Remark If/(x; 0) = a(0)6(x) exp [c(0)i(x)], then 


fl /(*,-; 0 ) = a"(0) [ ft exp [c(0) f d(x,)l, 
‘- 1 L**! J L 1=1 


and hence by the factorization criterion £ d(X t ) is a sufficient statistic. //// 


The above remark shows that, under random sampling, if a density belongs 
to the one-parameter exponential family, then there is a sufficient statistic. In 
fact, it can be shown that the sufficient statistic so obtained is minimal. 

The one-parameter exponential family can be generalized to the ^-param- 
eter exponential family. 


Definition 20 ^-parameter exponential family A family of densities 
/(• ; , 0*) that can be expressed as 

fix] e L ,..., 0*) = fl(0 2 , 0 k )b{x) exp X c (0j, . . . , 0*>/(x) (14) 

J = i 

for a suitable choice of functions a(', . . ., •)> K')» c,( # , •)* and ^(*)> 
/ = 1, is defined to belong to the exponential family. //// 

In Definition 20, note that the number of terms in the sum of the exponent 
is k , which is also the dimension of the parameter. 


EXAMPLE 26 If /(x; 0 U 0 2 ) = £„, „>(*), where(0 1 ,0 2 ) = (n,ff),thenf(x;0 u 9 2 ) 
belongs to the exponential family. 

1 r 1 (x - n 

/(x;0i ’ 02)= ^ exp r 2 \— 

i / i ^ 2 \ 

= ^ eXP l"2^) 

Take a(p, a) = (l/Jlncf) exp (-£ • ^ 2 /ff 2 ), b{x) = 1, c,(p, <r) = — 1/2 <t 2 , 
c 2 (p, (r) = n/ff 2 , d L (x) = x 2 , and d 2 (x) = x to show that ^, ff2 (x) can be 
expressed as in Eq. (14). //// 
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EXAMPLE 27 If 

then 

fix, 0 lt Ofi = * 0 ; / <0 i) (*) exp [(0 t - 1) log x + {0t - 1) logfl - *)] 

50 /(x, 8i, 82) belongs to the exponential family with cfflj 
1W„ 0 2 ) fi(x) « /«, „(*), cfO) -0,-1, c*( 0 )« 02 -l, = 

log x, and d 2 (x) - log (I ~x) //// 

Remark If f(x,Q^ >0ki-o(fii* ,0 k )6(x)exp ^Te/0,, « O t )dfx) 

7*=1 

then, under random sampling, 

0 i. . 8*) 

"fl^i , 0*)[n exp [ 5>,(0j, , 8*) . 

and so by the factorization criterion V d,(X,), , £ d/A'Jisasetof 

i-i (-1 

jointly sufficient statistics £ rf,(Y,) , £ are in fact minimal 

sufficient statistics fill 


EXAMPLE 28 From Example 27, we see that Y log X, and £ log (I - 2Q 
are jointly minimal sufficient when sampling from a beta density //// 


Our mam use of the exponential family will not be in finding sufficient 
statistics, but it will be in showing that the sufficient statistics are complete, a 
concept that is useful in obtaining “best’ estimators This concept will be 
defined m Sec 5 

Lest one get the impression that all parametric families belong to the 
exponential family, we remark that a family of uniform densities does not 
belong to the exponential family In fact, any family of densities for which the 
range of the values where the density is nonnegatrve depends on the parameter 0 
does not belong to the exponential class 
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5 UNBIASED ESTIMATION 

Since estimators with uniformly minimum mean-sejuared error rarely exist, a 
reasonable procedure is to restrict the class of estimating functions and look 
for estimators with uniformly minimum mean-squared error within the restricted 
class. One way of restricting the class of estimating functions would be to 
consider only unbiased estimators and then among the class of unbiased esti- 
mators search for an estimator with minimum mean-squared error. Con- 
sideration of unbiased estimators and the problem of finding one with uniformly 
minimum mean-squared error are to be the subjects of this section. 

According to Eq. (9) the mean-squared error of an estimator T of t( 0) 
can be written as 

- t(0)] 2 ] = var 0 [T] + {t(0J - S 0 [T]} 2 , 

and if T is an unbiased estimator of t( 0), then £ P \T] = t( 0), and so 
S 0 [\ [T — t ( 0)] 2 ] = var 0 [T]. Hence, seeking an estimator with uniformly 
minimum mean-squared error among unbiased estimators is tantamount to 
seeking an estimator with uniformly minimum variance among unbiased 
estimators. 

Definition 21 Uniformly minimum-variance unbiased estimator (UMVUE) 
Let X u X n be a random sample from /(•; 0). An estimator 
T* = / *(X i, . . . , X n ) of t (0) is defined to be a uniformly minimum-variance 
unbiased estimator of t(0) if and only if (i) <^[7*] = t(0), that is, 7* is 
unbiased, and (ii) var 0 [T*] < var 0 [T] for any other estimator T = 
/(X u . , . , X„) of t(0) which satisfies £ 0 [T] = t {0). //// 

In Subsec. 5.1 below we will derive a lower bound for the variance of 
unbiased estimators and show how it can sometimes be useful in finding an 
UMVUE. In Subsec. 5.2 we will introduce the concept of completeness and 
show how it in conjunction with sufficiency can sometimes be used to find an 
UMVUE. 


5.1 Lower Bound for Variance 


Let X x X n be a random sample from/(- ; Of where 0 belongs to 0. Assume 

that 9 is a subset of the real line. Let T = /{X u X n ) be an unbiased 
estimator of t(0). We will consider the case where /{*; 0) is a probability 
density function \ the development for discrete density functions is analogous. 
We make the following assumptions, called regularity conditions: 


(i) JL log f(x;0) exists for all x and all 8. 
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("> b! 6)dXi dXn 

-! !hR mie>ix ‘ lx • 

(«o !«*» » *») n /(*»» **» 

=j j/oc, <**. 

(iv) 0 < <?„[£|jlog/(Jf, 0)j | < oofor all 9 in 5 


Theorem 7 Cramer Rao inequality Under assumptions (i) to (iv) 


above 


var„ [T] Z 


i*m 2 


09 


where T = /(X t , Af„) is an unbiased estimator of i(0) Equality 
prevails in Eq (15) if and only if there exists a function say K(9 n), 
such that 

I ~ log/(x, , 6) = h{9, n)U(x u , x,) - x{9)] (16) 

Equation (15) is called the Cramer Rao inequality, and the right hand side 
is called the Cramer Rao loner bound for the variance of unbiased estimators 
of t(0) 


PROOF 


t ( 0)=^T ( 0) = ^J j/(x lf ,X')f\ J(x„0)dx 1 dx H 

=/ Ja*i. J [n/(^. o)] dxi dx„ 

- m ~[ /n in*i.o)dxj 

= f J4*i * *.) ^ [flM. 0)] dxi dx n 

“ I Site [n/<*> 8 i\ dx - 
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-/•••/ *■> - W ~ [ ^ /(*,; »] *, • • • *. 

" /"• *.) - 8)] 

x f\f(xr,0)dx l -'-dx n 

= ^[ [/(Zl) •••’*»>- < 0 )l[^ lQ g 0)]]. 

Now by the Cauchy-Schwarz inequality 

[T« 2 <? e [[^i, • • • , X„) - m?We[[~ l°g ft /(*,; 0)] 2 ], 


or 


but 


var e [T] > 


[ T '( 0)] 2 




e) ]] - 4?, ^ ioe/<Xi; e, ] ! ] 

-IE^[[l l0S/(X;<,) ][l l08/(X:<,) ]] 

= n«?„[[^log/(^0)] 2 ], 
using the independence of X { and Xj and noting that 

^ e \jd log f (X] 0 )] = / 0 )]/(*; 0 ) 

The inequality in the Cauchy-Schwarz inequality becomes an 
equality if and only if one function is proportional to the other; in our case 
8 n 

this requires that— log TT f( x t> 0 ) be proportional to l{x u t(0) 

CU i= 1 

or that there exists K = K(9, n) such that 

^ log [ ft /(**! 0)] = K ( e > n M*l> t(0)]. //// 
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The regularity conditions , which were stated for probability density 
functions can be modified for discrete density functions, leaving the statement 
of the theorem unchanged 

The theorem has two uses First, it gives a lower bound for the variance 
of unbiased estimators An experimenter using an unbiased estimator whose 
variance was close to the Cramer Rao lower bound would know that he was 
using a good unbiased estimator Second, if an unbiased estimator whose 
variance coincides with the Cramer Rao lower bound can be found, then this 
estimator is an UMVUE Equation (16) aids m finding an estimator whose 
variance coincides with the Cramer Rao lower bound In fact, if there exists a 
T* - /*(X lf , XJ such that 

£ ~ log/(x t , 6) = m n)[/*(x lt , Xj - r*(0)) 

for some functions K(0, n) and t*(0), then T* is an UMVUE of i*(0) 

EXAMPLE 29 Let X\ y , X n be a random sample from f{x, 0) = 
Oe 9 *I l0 n) (x) Take t(0) = 0 It can be shown that the regularity con 
ditions are satisfied x (0) — 1 , hence 

* var„ [T] ^ — — - — l - -|- 


Note that ~log/(x ff) = ~ (log 6 - 6.x) - I/O - and so 


^.[[|log/Wr, *)]’] - *.[(1 - *)’] - var [*) - 1 

Hence, the Cramer Rao lower bound for the variance of unbiased esli 
tnaiois of 0 is given by 


var 9 [Tji; 


1 



Similarly the Cramer Rao lower bound for the variance of un 
biased estimators of r(6) — 1/0 is given by 
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The left-hand side of Eq. (16) is 

(Iog 6 ~ exi > " ? (? - *■) - -”( 5 ' - s) ' 

By taking K(9, n ) = -n and utilizing the result of Eq. (16), we see that 
is an UMVUE of 1/0 since its variance coincides with the Cramer-Rao 
lower bound. //// 


EXAMPLE 30 Let X u ... , X„ be a random sample from/(x; 0) ==/(*; A) = 
e~ x X x /x\ for x = 0, 1, 2 

3 d e~ x k x 

Jjlog/fefl.-log — 

= ^(-A + xlogA-logx')*= - 1 + 

Therefore 

C io6/(x; «] T - 1 • [(! - OT - * 

1 rv , i . i 

= jjvarCX] = -jA = -, 

and so the denominator of the Cramer-Rao lower bound is n/X. Now, 
if t(A) = e~ x = P[X = 0], then the Cramer-Rao lower bound for the 
variance of unbiased estimators of t(A) = e~ x is given by var [T] 2: 

n 

ke~ 2X /n. Note that T = (1 jn) £ is an unbiased estimator of t(A) 

1= 1 

= since S[T] = (1/n) f S[I w W\ = (1 fit) £ = *-*. / (0 ,W = 1 

i=i i=i 

if X { = 0, and I m (X,) = 0 otherwise; so = 1 ' P Wi = 0] 

+ 0 • P[X t *£ 0] = e~ x . T is the proportion of observations in the sample 
that are equal to 0. var [T] = (l/n)e~\l - e~ x ), as compared to the 
Cramer-Rao lower bound, which is (1 /ri)ke~ 2X . Note that 

(l/n)e -A (l - e ~* ) £ (1 /n)Xe~ 2X , 

as it should be. An UMVUE of t(X) = e~ x is found in Example 34. 

We note that £ (d/dk) log /(*,; X) = £ (“1 + *»M) = («MX* ~ 
hence, X is the UMVUE of A by Eq. (16). //// 
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In general, the Cramer Rao lower bound is not an attainable lower bound, 
that is, there often exists a lower bound for variance that is greater than the 
Cramer Rao lower bound We will see several such examples m Subsec 52 
below We will see that an UMVUE can exist whose variance does not coincide 
with the Cramdr Rao lower bound 

We conclude this subsection with several remarks, the statements of which 
are not necessarily mathematically precise All the same, the remarks are 
important and do relate some earlier concepts to the Cramer Rao lower bound 

Remark Under certain assumptions involving the existence of second 
derivatives and the validity of interchanging the order of certain differen 
tiations and integrations, 

*K*/er. «>]’] - -'«[!? «] mi 

This remark is computationally useful if the first expectation is more 
difficult to obtain than the second The proof is left as an exercise 

Remark If the maximum likelihood estimate of 6, say 9 = $(x t , ,jrJ, 
is given by a solution to the equation 

g log L(8, x u , x n ) s ^ log 8) = 0, 

and if T* = i*{X lt , X„) is an unbiased estimator of x*(fl) whose 
variance coincides with the Cramer Raolower bound, then ,xj = 

**<&*!, , *,)) 

PROOF 

0 = 4 log n /(*i, 0)1 - m n)y*(x lt , x n ) - t*(0)J I 

1-1 \»mi 

by Eq (1€) and the definition of 0 //// 

This remark tells us that under the conditions of the remark a maximum 
likelihood estimator is an UMVUE’ 

Remark If T* = t\X u , X„) is an unbiased estimator of some 
t*(8) whose variance coincides with the Cramer Rao lower bound, then 
/( » 0) ts a member of the exponential class and, conversely, if /( 0) 
is a member of the exponential class then there exists an unbiased esti 
mator, say T*, of some function, say t*(0), whose variance coincides with 
the Cramer Rao lower bound //// 
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We will omit the proof of this remark. It relates the Cramer-Rao lower 
bound to the exponential family; in fact, it tells us that we will be able to find 
an estimator whose variance coincides with the Cramer-Rao lower bound if 
and only if the density from which we are sampling is a member of the expo- 
nential class. Although the remark does not explicitly so state, the following is 
true. There is essentially only one function (one function and then any linear 
function of the one function) of the parameter for which there exists an unbiased 
estimator whose variance coincides with the Cramer-Rao lower bound. So, 
what this remark and the comments following it really tell is: The Cramer- 
Rao lower bound is of limited use in finding UMVUEs. It is useful only if we 
sample from a member of the one-parameter exponential family, and even then 
it is useful in finding the UMVUE of only one function of the parameter. 
Hence, it behooves us to search for other techniques for finding UMVUEs, and 
that is what we do in the next subsection. 

5.2 SufScicncy and Completeness 

In this subsection we will continue our search for UMVUEs. Our first result 
will show how sufficiency aids in this search. Loosely speaking, an unbiased 
estimator which is a function of sufficient statistics has smaller variance than an 
unbiased estimator which is not based on sufficient statistics. In fact, let 
/(• ; 6) be the density from which we can sample, and suppose that we want to 
estimate t ( 0). Let us assume that T = <f(X 1 , X r ) is an unbiased estimator 
of t (0) and that S = o(X lt X^} is a sufficient statistic. It can be shown that 
another unbiased estimator, denoted by T', can be derived from T such that 

(i) T is a function of the sufficient statistic 5 and (ii) T is an unbiased estimator 
of r(0) with variance less than or equal to the variance of T. Therefore, in our 
search for UMVUEs we need to consider only unbiased estimators that are 
functions of sufficient statistics. We shall formalize these ideas in the following 
theorem. 

Theorem 8 Rao-BIackwell Let X k , .... X n be a random sample from 
the density /(•; 0 ), and let S k = <>i(X ..., X„), ..., S k = <> k {X u ..., X„) 
be a set of jointly sufficient statistics. Let the statistic T = t{X k , . . . , X^) 
be an unbiased estimator of r(0). Define T' by T' = S[T\S k , ..., SJ. 
Then, 

(i) T' is a statistic, and it is a function of the sufficient statistics 

S u ...,S k . Write T' = /(Si. . • • . Si). 

(ii) S 0 [T’] = t( 0); that is, T is an unbiased estimator of r(0). 

(iii) varj [T] £ var 0 [T] for every 0, and var„ [T'J < var„ [T] for 
some 0 unless Tis equal to T’ with probability 1. 
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PROOF (0 5j, , s k arc sufficient statistics, so the conditional 

distribution of any statistic, in particular the statistic T, given S, f s 4 
is independent of 0, hence T = <S'[7’|S 1 , , S,] is independent of 0, and 

so f is a statistic which is obviously a function of S„ , s k 
00 , S t ll * = r(0) [using Eq (26) of 

Chap IV] (ui) We can write 

var,m - <r,[rD J ] r +r - s t [T D 1 ] 

* n*I + rxr - <? 8 [t D] + var, [T ] 

But 

- T%V - S[T I)] - * t [S[(T ~ TXT ~ «?[r Dl^i, , Sal) 
and 

<r ) [(7’-T')(r- < f[rDl-s l -j 1 , , $* = **] 

= {/ (s t so - JirmiT - ms, - *1, . S t » S t \ 

= {/<»!, .J.WITMTIS,-*, ,S k ~ Si ) 

-/[Tte-*, ,s 4 «j*D 

-{*(* >s t )l 

= 0 

and therefore 

var* [21 = <? e l(T - TTJ + var* [T ] var, [T] 

Note that var, [T] > var* [T ] unless T equals T with probability 1 //// 

For many applications (particularly where the density involved has only 
one unknown parameter) there will exist a single sufficient statistic, say 
S = c(X | , X*) which would then be used in place of the jointly sufficient 
set of statistics S,, , S, What the theorem says is that, given an unbiased 

estimator, another unbiased estimator that is a function of sufficient statistics 
can be derived and it will not have larger variance To find the derived statistic, 
the calculation of a conditional expectation which may or may not be easy, is 
required 


EXAMPLE 31 Let X t , X„ be a random sample from the Bernoulli density 
/Or 0) = 0 T (l - 0) l_x for x = 0 or l X k is an unbiased estimator of 
t(0) = 0 We use X k as T = /(X lt , JfJ in the above theorem- £ Xt 
is a sufficient statistic, so we me S ~ £ X t as our set (of one element) of 
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sufficient statistics. According to the above theorem T = <?[T|S] = 
* s an unbiased estimator of 0 with no larger variance than 
T — ' Let us evaluate <5*[T|S]. We first find the conditional distribu- 
tion of X t given ]T X t = s. X x takes on at most the two values 0 and 1. 

*[*1 = 0|£A :,«s]« 

P[X, - 0] • t X, = 5 J (1 - 0)(" ~ 1 Joxi - or 1 -* 




p\x, = utx,-!-l 

_ L j c 2 J __ 

p [lr s ] 

nx t = i] • ?[i x,= 5 - 1] 0 ■ (J"})<r*(i - or'-*" ~ s 

il/H (" s )»'d-o)- 

We note in passing that the conditional distribution of X x given ]T X t = s 
is independent of 0 , as it should be. Also, we could have derived the 
conditional distribution with much less effort by asking: Given that you 
have observed s successes in n trials, what is the probability that the 
first trial resulted in a success ? This probability is sjn* (Sec Example 28 
in Chap. I.) 
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hence. 

i*. 

r-^ — 

n 

The variance of X, is Of l - 0), and the variance of T is Ofl -6) ! n , to 
for a > 1 Che variance of V ft actually smaller than the variance of 
T-A, HI} 


Before leasing Theorem 8, two comments are appropriate Firsr, tfthe 
unbiased estimator Tw already a function of only „ , Sj, then the denied 
statistic T will he identical to T. and hence no improstment in variance ante 
espccted Second although the set of jointly sur.cient statistics « an aib*tniy 
set, in practice one would naturally use a irlrlrnat u\ of jointly safxieri statistics 
if inch were available 

Theorem 8 tells us how to improve on an unbiased estimator by con- 
ditioning on sufficient statistics, for tone eVimalion problems this unbiased 
estimator obtained by conditioning on sufficient statiitics, will be an UMVUt 
To aid »n identifying those estimation problers for wh ch a denied estimator 
is an UMHJE, the concept of completeness of a family of densities Is useful 

Definition 22 Complete family of drmlfJei Let Xu dtco'e a 

random sample from the density J[ ,0) with parameter space JJ, u*d let 
T m /(X t , t J be a statistic The family of densities of T is defined 
to be con-pkif if and only if ^MrjJnO for all 0 c£} imp’d that 
P»U{T) » OJ e 1 for all 0c£, where a{T) it a statistic Also, the statistic 
Tu said to be cwplete if and only if its family of densities is comp’ete. 

VH 


Aroiher way of stating that a statistic T is complete is the following 
T is complete if and only if the only unbiased estimator of 0 that is a function cf 
T is the statistic that is identically 0 with probability 1 


EXAMPLE 32 Let Jt|, , .1. be a random sample from the Bernoulli 
density Thestatislic T - — A'| is not complete since £ t \X t - A'j] - 0 

and .V, - A J is not 0 with probability 1 Consider the statistic T ■* £ X, 
Let *(7) be any statistic that is a function of Tfor which «M»<T))eO 
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for all 0 eg, that is, for 0 < 0 ^ 1 . To argue that Tis complete, we must 
show that jt(r) = 0 for t = 0, 1 , . . . , n. Now 

M = j^oQo'O - D- = (i - e>;|-w(")[Wi - W; 

hence, £ 0 [x(T)) s 0 for all 0 < 0 < 1 implies that 


or 



£=0 


for all a, where a = 0/(1 - 0). Now in order for a polynomial in a to be 
identically 0, each coefficient of a', r = 0, ..., n, must be 0; that is, 

=0for<==0 ’---> n ’ but *(0 = 0 for t = 0, //// 


EXAMPLE 33 Let X u . . . , X n be a random sample from the uniform distri- 
bution over the interval (0, 0), where 0 = {0: 0> 0}. Show that the 
statistic Y n is complete. We must show that if <?$[;t(Y n )] = 0 for all 0 > 0, 
then P e [x(Y n ) = 0] =s 1 for all 0 > 0. 

£ 0 M Y n)] = J *0’)/r„0’) d y = J o dy, 

and £ 0 [x(Y^] = 0 for all 0 > 0 when and only when 

e 

- f x{y)y n ~ l dy = 0 for all 0 > 0 
0 * 

or 

r e 

j *{y)f~ X dy^O fora!10>O. 

Differentiating both sides of this last identity with respect to 0 produces 
^(0)0 n “ 1 £= 0, which in turn implies that x{0) = 0 for 0 > 0. //// 

In general, demonstrating completeness can require tricky analysis. The 
two above examples are exceptions. We state now, without proof, a theorem 
that gives us completeness of a statistic. It will be our main tool for arguing 
completeness. 
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Theorem 9 Let^, , X„ be a random sample from/( ,0),0eg where 
6 is an interval (possibly infinite) If /(x, 0) « o(0)£>(x) exp [c(0)<J(jt)) 
that is /( , ff) is a member of the one parameter exponential family, 
then £ d(X,) is a complete minimal sufficient statistic //// 

Theorem 9 shows once again the importance of the exponential family or 
exponential class We are finally adequately prepared to state the theorentthat 
is useful in finding UMVUEs 

Theorem 10 Lehmann Scheflc Let X it , X H be a random sample 
from a density /( , 0) If S = o(X lt , Xj is a complete sufficient 
statistic and if T* = 4*(S) a function of S, is an unbiased estimator of 
t (0) then T* is an UMVUE of r (0) 

proof Let T be any unbiased estimator of t (ff) which is a function 
of S that « T = ^ (S) Then -TJsO for all Seg and 

T*-T is a function of S, so by completeness of S, = t (S)] = I 

for all 0 e S Hence there is only one unbiased estimator of t(0) that 
is a funetton of 5 Now let T be any unbiased estimator of t(0) T* 
must be equal to <?[T[S] since <f[T|5] is an unbiased estimator of i(0) 
depending on S By Theorem 8, var 8 [T*J <, var 8 [T] for all Q 6 §, so 
T* is an UMVUE //// 

Let us review what this important theorem says First, if a complete 
sufficient statistic S exists and if there is an unbiased estimator for r(0), then 
there is an UMVUE for t( 0) second, the UMVUE is the unique unbiased 
estimator of i(0) which is a function of 5 

To actually find that unbiased estimator of x(0) which is a function of S, 
we have several ways of proceeding First, simply guess the correct form of the 
function of S that defines the desired estimator Second, guess or find any 
unbiased estimator of t(0), and then calculate the conditional expectation of 
the unbiased estimator given the sufficient statistic. Thud, solve for /*( ) 
in the equation d»[/*(S)J = t(0) Such an equation becomes the tntegral 
equation J f*(s)f 5 (s) ds s t( 0) in the case of a continuous random variable S 
and becomes the summation £ /*(j)/ 5 (s) 3 t( 0) forS a discrete random variable 
We will employ two of these methods in the following examples 

EXAMPLE 34 Let X lf , X n be a random sample from the Poisson density 

xl 


f(x,k) 


for x — 0, 1, 
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We saw in Example 25 that f(x; A) belongs to the exponential family with 
d{x) = x. By Theorem 9, the statistic £ Z, is complete and sufficient. 
To find the UMVUE of A itself, it suffices to guess a function of £ Z, 
whose expectation is A. Noting that A is the population mean, (1 In) £ Z, 
is the obvious choice; so (1/n) £ Z, is the UMVUE of A. ~ 

Consider now estimating t(A) = e~ x = P[Z f = 0]. (Recall Ex- 
ample 30.) Let us derive the UMVUE of e~ x by calculating the condi- 
tional expectation of some unbiased estimator given the sufficient statistic. 
Any unbiased estimator will do as the preliminary estimator whose 
conditional expectation needs to be calculated; so we may as well choose 
one that would make the calculations easy. / ( o)(A'i) is an unbiased estimator 
of e~ x and is relatively simple since it can assume only the two values 0 
and 1. By Theorem 10, S[I W {X{) [ £ Z,] is the UMVUE of e~ x . Tofind 
the desired conditional expectation, we first find the conditional distribu- 
tion of Zj given £ X t . 


r n -] PXi-O'.YXt-s 

p *1 = 0|£X, = S -- | r „ r - L 

L 1 J p|£z, = sJ 

p\tXi = s 

.1 


il 

& 

0* 

= 0]P 

- n 

I* 

L 2 


p 

£Z { = s 
-1 



e -X e -(n-l)A[(„ _ 1)A]I/s , /„ _ A* 




e~ n Xnl) 5 lsl 

Therefore, 

WmOMIX*!-*]-*** = (^T 1 ) ; 


for n > 1. 


hence 


f-fT 


is the UMVUE of e~ x for n> 1. For n- 1, / ( o)(-^i) is an unbiased 
estimator which is a function of the complete sufficient statistic X u and 
hence I m &d itself is the UMVUE of The reader may want to derive 

the mean and variance of 

(¥)“' 
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and compare them with the mean and variance of the estimator 
£ iven «n Example 30 //// 


EXAMPLE 35 Let A',, , X„ be a random sample from /(x, 0)«= 

0e~**I (O „,(*) Our object is to find the UMVUE of each of the following 
functions of the parameter 0 0, 1/0, and e~ K> =* P[X>K] for given K 
Since 0e~ ex I( O *)(*) 1S a member of the exponential class {see Example 24), 

the statistic S =* X) X t is complete and sufficient 

X n = (1/n) Y, X t , which is a function of the complete sufficient 

statistics = £ X{,is an unbiased estimator ofl/0, hence by Theorem 10, 
X, is the UMVUE of I/O 

To find the UMVUE of 0 , one might suspect that the estimator is 
of the form c/£ X t , where c is a constant which may depend on n Now 


'•fey ■ •*•[!]■ 


=,c rkf? v 


r(n)J( 


r(«) J 0 


/; 


for n > I So StlcfZ X/l = 0 when c = rt - 1 , hence (n - l)/£ Xt is the 
UMVUE of D for n > 1 The variance of { n — 1 )/£ X ( is given by 

— 2) for n> 2 

Although one might be able to guess which function of S == J) X ( 
is an unbiased estimator for e~ KS , let us derive the desired estimator by 
starting with the following simple unbiased estimator of e~ K> W*|) 
Note that S t \ I (X .,(**)] = 0 P[X t £ K] + 1 P[X t > X] = P[X t >K] - 
e~ K °, so V ^(Ar) is indeed an unbiased estimator of e~ l \ and therefore 
by Theorems 8 and 10 S a [I {K is the UMVUE ofe"* 5 Now, 

Uh S « *1 * P[T lK „)(*,) «l]S«j] = P[X 1 >XjS = j] In 
order to obtain P[Xi > X| S — j] we will first find the conditional distnbu 
tion of X t given S =* s 
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is the UMVUE for e~ Kt for n > 1 (Actually the estimator is applicable 
for b = 1 as well ) It may be of interest and would serve as a check to 
verify directly that 




is unbiased 
<f f 


-'o r(n) 


where the substitution u — s — K was made 


III! 


In closing this section on unbiased estimation, we make several remarks. 

Remark For some functions of the parameter there is no unbiased 
estimator For example, in a sample of size 1 from a binomial density 
there is no unbiased estimator for 1/0 Suppose there were, let T ■* /(X) 

denote it Then t?,[r] = £ ^*)(r) — 0)*~ x a 1/0, which says that 

an nth-degree polynomial m 0 is identical to 1/0, which cannot be //// 

Remark We mentioned in Subsec 5 1 that the Cramer Rao lower bound 
is not necessarily the best loweT bound For example, the Cramer Rao 
lower bound for the variance of unbiased estimators of 0 m sampling 
from the negative exponential distribution is given by Q 2 /n (see Example 29), 
and the variance of the UMVUE of 0 Is given by 0 2 J(n — 2) (see Ex- 
ample 35) 6*1(11 — 2) is necessarily the best lower bound //// 

Remark For some estimation problems there is an unbiased estimator 
but no UMVUE. Consider the following example //// 


EXAMPLE 36 Let , X. be a random sample from the uniform density 
over the interval (0, 0 + 1] We want to estimate 0 JT, and 
(Xi + VO/2 — $ ate unbiased estimators of 0, yet there is no UMVUE 
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of 0. For fixed 0 < p < 1, consider the estimator g{X 1 - p) + p, where 
the function g{y) is defined to be the greatest integer less than y. Now 

*19(X 1 -P)+P] 

r® + * -0 + 1 — p 

~J SK x i ~ P)dx t + p= #0’) dy + p. 

6 J e-p 

For fixed d and p, there exists an integer, say N = N(Q, p), satisfying 
0 — p < N <6 + 1 — p. Hence, 

Wi -P) + P) 

f 0+1 ~p [X -0+1-p 

= I 9(y) dy + P = J (AT — 1) rfy -M AT ify 4* p « 0. 

So — p) + p is an unbiased estimator of 0 . Moreover, if 0 + 1 — p 
is an integer, say J, then ^ - p) = J - 1 for all x t satisfying 
O-p<x 1 -p<0-fl~p; so -p) + p = J“l-fp = 0 + l- p 
— 1 -f p = 0 for all 0 < Xi ^ 0 -f 1 ; that is, p(A" x — p) + p estimates 0 with 
no error, and hence has zero variance for 0 -f 1 — p equal to any integer. 
So, we have an estimator, namely g(X t — p) 4- p, which has zero variance 
for 0 as any integer — 1 + p. But 0 < p < 1 is arbitrary; so for any fixed 
0, say 0 O , we can find an unbiased estimator of 0 which has zero variance 
at 0 o . Hence, in order for an estimator to be the UMVUE of 0, it must 
have zero variance for all 0; that is, it must always estimate 0 without 
error. Clearly, no such estimator exists. {The reader may wish to show 
that var [g(X t — p) + p] ~ [N - (0 - p)][(0 4 1 — p) — N], where 1 V = 
N (0, p) is an integer satisfying 0 - p < N < 0 + 1 — p.} //// 

Remark It is sometimes possible to find an UMVUE even when a 
minimal sufficient statistic is not complete. See Prob. II, p. 313, in 
Rao [17]. //// 


6 LOCATION OR SCALE INVARIANCE 

In the last section we employed the property of unbiasedness as a means of 
restricting the class of estimators with the hope of finding an estimator having 
minimum mean-squared error within the restricted class. In this section we will 
indicate how an alternative property, the property of invariance , can be used to 
restrict the class of estimators. Our discussion will be limited to only two types 
of invariance, namely, location invariance and scale invariance, a fuller discus- 
sion, which is beyond the scope of this book, can be found in Refs. [12] and [19]. 
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6 1 Location Invariance 

If the observations X t , , X„ represented measurements of some sort and the 
parameter being estimated was also measured in the same units, one might 
reasonably require that an estimator 4 , » ) satisfy the property 

/(x, + c, x 2 + c, , x n + c) = 4*x, , X„) + C for every constant c The 

idea is that if a constant c is added to each of the measurements x„ . , 
then the estimator evaluated at the adjusted measurements + c, , x, + e 
ought to adjust the estimated values 4*i» , x») by adding the same constant 

to it For example, suppose that it is desired to estimate the average weight 
of a group of pigs when the only method available for weighing is for a person 
to stand on a scale holding a pig, so both the pig and person are weighed If 
one person were to hold the pigs, the measurements (weights) Xj+c, ,, 
x* + c would be obtained, where x t is the weight of the ith pig and c is the person’s 
weight If, on the other hand, someone else were to hold the pigs, the measure 
merits Xx + c , , x„ + c' would bo obtained, where c' vs the other prison's 

weight It seems reasonable that the estimate of the average weight of the group 
of pigs obtained should not depend on which person held the pigs, that is, the 
estimate should not vary with c, the weight of the pig holder We define a 
location invariant estimator accordingly 

Definition 23 Location Invariant An estimator T =* f(X u , X.) is 
defined to be location invariant if and only if 4*i + , x„ + c) = 

.*„) + « for all values x u , x„ and all c //// 

A number of the estimators that we have encountered are location 
invariant, for example, X„ and (Y, + Y„)/2, as the following shows 

4*1 + C, , x, + c) = ^ fo t - f) . = + c *= 4*1. ,x f ) + c 

n n 

for 4*1 1 , *„) = x, , and 

4*i + c, , x„ + c) 

_ min [x t + c , x„ + c] 4- max [x, + c, , x„ + cj 
2 

,x a ] + c + max [x f , ,x,]-)-c 
2 

„ ram f*i. , x,] + max [x it , x„\ +e 


*= 4*n , *„) + c 
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for t(x l , x„) — (y 1 + y^/2. On the other hand, quite a number of 
estimators are not location-invariant; for example, S 2 and Y n - Y u as the fol- 
lowing shows: Take T = J(X U . . . , X n ) = S 2 = £ {X t - Z n )> - 1); then 
4* + * x n + c) = X[^ + c-X (x, + c)/n] 2 /(n - 1) = /fo, . . . , * n ), in- 
stead of 4*i, •••, xj + c. Now take T = J(X U ..., A;)= F„- 7 i; then 
4*i + c, . . . , x n + c) = max [xj + c, . . . , x n + c] — min [x t + c, . • . , x„ + c] = 
max [xj, x„] + c — min [Xj, x„] — c = J(x 1 , x n ), instead of 

X„) + C. 

Our use of location invariance will be similar to our use of unbiasedness. 
We will restrict ourselves to looking at location-invariant estimators and seek 
an estimator within the class of location-invariant estimators that has uniformly 
smallest mean-squared error. The property of location invariance is intuitively 
appealing and turns out also to be practically appealing if the parameter we are 
estimating represents location. 

Definition 24 Location parameter Let {/( * ; 0), 0 e S} be a family 
of densities indexed by a parameter 0, where 5 is the real line. The param- 
eter 6 is defined to be a location parameter if and only if the density 
f(x\ 6) can be written as a function of x — 8; that is,/(x; 8) = h(x — 0) 
for some function h( • ). Equivalently, 0 is a location parameter for the 
density f x (: c; 0) of a random variable X if and only if the distribution of 
X — 0 does not depend on 0. //// 

We note that if 0 is a location parameter for the family of densities 
{/( * ; 0), 0 g S}, then the function h{ • ) of the definition is a density function 
given by h( * ) =/( * ; 0). 


EXAMPLE 37 We will give examples of several different location parameters. 
If/(x; 0) = <!>o f i{x)> then 0 is a location parameter since 

<f> e , i(x) =^= expj- j(x- 0) 2 j = <j> 0 , i(x - 0). 

Or if X is distributed normally with mean 0 and variance 1, then X -6 
has a standard normal distribution; hence the distribution of X 0 is 
independent of 0. 

'If f(x- 0) = /(o -j „+»(*), then 0 is a location parameter since 

f(x; 0) = „ +J >(*) = I ( -i . .)(* ~ e )> a function of * " e - 

If f, x . a j _ \j n [\ + ( X _ a ) 2 ], then a is a location parameter since 

f(x; a) is a function of x - a. Ml 
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We will now state, without proof, a theorem that gives within the class 
of location invariant estimators the uniformly smallest mean squared cnoi 
estimator of a location parameter The theorem is from Pitman [41] 

Theorem 11 Let X { , X , denote a random sample from the density 
/( 0) where 6 is a location parameter and 5 is the real line The 

estimator 

, x*) = -pr 1 (17) 

JU f(X lt B)d9 

is the estimator of 0 which has uniformly smallest mean squared error 
within the class of location invariant estimators //// 


Definition 25 Pitman estimator for location The estimator given in 
Eq (17) is defined to be the Pitman estimator for location lffl 

According to the formula given m Eq (17), determining the Pitman 
estimator requires evaluating the integrals given in the numerator and denomi 
nator, such evaluation may not be easy Note that the integration is with 
respect to the parameter , so the resulting ratio will be a function of A), , X, 


EXAMPLE 38 Let Xi , X„ be a random sample from a normal distribution 
with mean 0 and variance unity We saw m Example 37 that 0 is a lota 
tion parameter Our object is to find the Pitman estimator of 6, which is 
given by Eq (I7J In the following senes of equalities one should be 
forewarned that cancellations and insertions are being made simultaneously 
tn the numerator and denominator 

/ *n ,W de J e{\ly/2nf exp[-*£ (*t - W 1 ] dO 

J,n^ j (l/ N /27t)’ 1 exp [“i — 0) 1 ] d9 

exp [-(nffle' + O^X^dQ 

J exp [— (n/2)!? 1 + 0£ */] d0 
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Jo exp [— (n/2)(0 — X„) 2 ] dd 
Jexp [-(n/2){Q - X n ) 2 ]d9 

J 0[I/y2^(]/v^)Jexpf-H(0 - XJ/(l/^)] 2 }dO 

J [1A/^(1AA)] exp{-i[(0 ~ ^)/(l/V^)] 2 } dd 

= *„ 

by noting that the last denominator is just the integral of a normal density 
with mean X h and variance 1/n and hence is unity, and the last numerator 
is the mean of this same normal density and hence is X n . 

We note that, for this example, the Pitman estimator of 0, which is 
uniformly minimum mean-squared error among location-invariant esti- 
mators, is identical to the UMVUE of 9; that is, the estimator that 
is best among location-invariant estimators is also best among unbiased 
estimators, //// 


EXAMPLE 39 Let X l9 . * . , X„ be a random sample from a uniform distribu- 
tion over the interval (0 — i, 0 -f i). According to Example 37, 0 is a 
location parameter. The Pitman estimator of 0 is 

j o n ho-i.B+it*,) de [e n iut-bx+tw d ° 

/ n he-i,e+i)(Xi)dO J O hxi-i.x,+i 0)d6 

fi'MddO 1 (Y,+{) 2 -(Y„-j) 2 _ Y 1 + Y n 

K'+tde 2 (r,+i)-a-i) 2 

Recall that for this example there is no UMVUE of 9 . (See Example 36.) 

//// 

Remark A Pitman estimator for location is a function of sufficient 
statistics. 

proof If S t = <i l (X u . . . , X n ), S k = <s k (X u . . . , X n ) is a set of 
sufficient statistics, then by the factorization criterion f(x 6 ) — 

i = X 

g(s L s k ; 0)h(x u so the Pitman estimator can be written as 
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joflf(X t ,0)M j Og(S l S k \ mx I *0 m 

{ n ®) d0 w *0 ^ 

JOpfS* S*;0)rf0 

which is a function of S lt . , S k . //// 

6 2 Scale Invariance 

For those experiments in which measurements can be made in different units, 
such as length being measured in either inches or centimeters, weight being 
measured in either pounds or kilograms, or volume being measured in either 
quarts or liters, one might reasonably require that his statistical procedure be 
independent of the measurement units employed. If the statistical procedure 
is that of point estimation, then one might require that the estimator that is to 
be used satisfy the property of scale Imariance defined below. The idea is that 
an estimator will be scale-invariant if the estimator does not depend on the scale 
of the measurement 


Definition 26 Scalcvinvariant An estimator T » /(Xu ...» XJ is 

defined to be scale mianant if and only if /(car, exj == c/fo xj 

foi all values jr t , , x„ and all c > 0. //// 


A number of the estimators that we have considered are scale-invariant, 
including X„, y/E*, (Y t + Y,)/2, and Y, — Y k . Our discussion of scale- 
invariant estimators will be limited to problems concerning estimation of scale 
parameters defined below. 

Definition 27 Scale parameter Let {J\ • ; 0), 0 > 0} be a family of 
densities indexed by a real parameter 0 The parameter 9 is defined to be a 
scale parameter if and only if the density f(x, 0) can be written as (l/0)A(x/0) 
for some density Ji( ) Equivalently, & is a scale parameter for the 
density /*(*, 0 of a random variable X if and only if the distribution of 
X[9 is independent of Q ((If 

Note that if 0 is a scale parameter for the family of d ensities {/{ * ; 0), 0 > 0}, 
then the density h( * ) of the definition is given by h(x) = /(*; I). 
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EXAMPLE 40 We give several examples of scale parameters If /(*• A) = 
(i/;>-*%, °o)(*)i then A is a scale parameter since e~ 7 / ( 0 |CO) 0) is a 
density. Note that this parameterization of the negative exponential 
distribution is not the parameterization that we have used previously 
If 

f(x;0) = <f> 0t a2 (x) = --pL- exp , 

y/2na L 2\oy _ 

then cr is a scale parameter since (l/^/ln) exp (~iy 2 ) is a density. 

If /(*; 0) = (l/^)/(o f 0 )W = (I/^)f(o.i)( x /ff)j then 6 is a scale param- 
eter since /( 0f n(y) is a density. 

If/(x; 0 ) = (l/ 0 )/( 0 , 2 ^)M = (i/^)f(i, 2 )W 0 ), then 6 is a scale param- 
eter since 7 (li 2 )O0 is a density. //// 


Our sole result for scale invariance, a result that is comparable to the 
result of Theorem 11 on location invariance, requires a slightly different frame* 
work. Instead of measuring error with squared-error loss function we measure 
it with the loss function £(t; 0) = (t - 6) 2 /6 2 =(tjO — l) 2 . If [ t - 6\ represents 
error, then 100 1 1 — 0\/9 can be thought of as percent error, and then (t — 0) 2 /0 2 
is proportional to percent error squared. We state the following theorem, also 
from Pitman [41], without proof. 

Theorem 12 Let X u .,.,X n be a random sample from the density/(- ;0), 
where 0 > 0 is a scale parameter. Assume that f(x; 0) = 0 for * <, 0; 
that is, the random variables X x assume only positive values. Within 
the class of scale-invariant estimators, the estimator 

S\mflf(.xr.o)de 

AX, ^ (18) 

f (1/0 3 ) n f(X l ;9)de 

J 0 I 

has uniformly smallest risk for the loss function C(t\ 0) = (t — 0) 2 /0“. //// 

Definition 28 Pitman estimator for scale The estimator given in Eq. 
(18) is defined to be the Pitman estimator for scale. //// 

Remark The Pitman estimator for scale is a function of sufficient 
statistics. Nil 
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EXAMPLE 41 Let X u , X K be a random sample from a density/^, Q = 
(l/p)/ (0 ^(x) The Pitman estimator for the scale parameter Q is 

J>^C W o «(*,) ^ 

/"o/e 3 ) no/Wto «W M JV'~ 3 <0 

{!/[(*+ 2) -11)Y.- ( *+ am B+2 

{1/K* + 3) - l]}Y ( T t ' +3,+I “ n + 1 r * 

We know that Y, is a complete sufficient statistic and <?[Y,) =[«/(« + 1)]0, 
so by the Lehmann Scheffe theorem [(« + I)/rt] Y. is the UMVUE of 8 

mi 


EXAMPLE 42 Let X u , X, be a random sample from the density f(x, /) = 
(1/2) exp (— x/X)! (0 w) (x) The Pitman estimator for the scale param- 
eter 2 is 


jTW) ft ./(*«. *) dX / # 0M" +S ) exp (- Z XJX)dX 
J>^ 2) d\ j”(l/2" +3 )exp(-X^i/2) dX 

Qs4Z*ir***1Zx&)** 

f a*e _ *da 

-ZxX 

f c 1> + V* 

= yv r(n + 1) _ £ x, 

L ‘n»+2) n + 1 


(It can be shown that the UMVUE of 2 is £ XJn ) 

Y5ote that^ XJn is a scale-invariant estimator, and, hence, since 
£ XJ(n + 1) is the scale-invariant estimator having uniformly smallest 
risk for the loss function (l — 8) 2 J0 2 , the risk of £ XJ(n + 1) is uniformly 
smaller than the risk of £ XJn. Also, since here mk equals 1 jO 2 times the 
MSE, the MSE of £ XJ(n + 1) is uniformly smaller than the MSE of 

T.XJ* III! 
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7 BAYES ESTIMATORS 

In our considerations of point-estimation problems in the previous sections of 
this chapter, we have assumed that our random sample came from some density 
f( * ; 6), where the function/*( * ; * ) was assumed known. Moreover, we have 
assumed that 0 was som z fixed, though unknown to us, point. In some real- 
world situations which the density /( • ; 0) represents, there is often additional 
information about 0 (the only assumption which we heretofore have made about 
0 is that it can take on values in 5). For example, the experimenter may have 
evidence that 0 itself acts as a random variable for which he may be able to 
postulate a realistic density function. For instance, suppose that a machine 
which stamps out parts for automobiles is to be examined to see what fraction 
0 of defectives is being made. On a certain day, 10 pieces of the machine’s 
output are examined, with the observations denoted by X u X 2 , . . . , X 10 , where 
Xi = I if the ith piece is defective and X t — 0 if it is nondefective. These can 
be viewed as a random sample of size 10 from the Bernoulli density 

f(x; 0) = 0*(I - 0) 1 -%,,!)(*) for 0 < 0 < 1, 

which indicates that the probability that a given part is defective is equal to the 
unknown number 0. The joint density of the 10 random variables X ly X 2 , 

X l0 is 

10 

0 Zxt i 1 - ho, !}(*/) for 0 < 0 <; 1. 

The maximum-likelihood estimator of 0, as explained in previous sections, is 
© = X. The method of moments gives the same estimator. Suppose, how- 
ever, that the experimenter has some additional information about 0; suppose 
that he has observed that on various days the value of 0 changes and it appears 
that the change can be represented as a random variable with the density 

0o(0) = 60(1-0) W^ 

An important question is: How can this additional information about 0 be 
used to estimate 0 O > where 0 O is the value that O was equal to on the day the 
sample was drawn? 

To examine this problem, we will assume, in addition to the assumption 
that our random sample came from a density /( • ; 0), that the unknown param- 
eter 0 is the value of some random variable, say 0. We will still be interested 
in estimating some function of 0, say t(0). If 0 is a random variable, it has 
a distribution. We let G(-) = (7 e (*) denote the cumulative distribution function 
of © and g(-) = g e (-) denote the density function of 0, and we assume these 
functions contain no unknown parameters. In order to emphasize that the 
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distribution of O is over the parameter space, we have departed from our custom 
of using F( ) and /( ) to represent a cumulative distribution function and 
density function, respectively, and have used G( ) and g( ) instead 

If we assume that the distribution of O is known, we have addLUcr.il 
information So an important question is How can this additional information 
be used in estimation? It is this question that we will address ourselves to ta 
the following two subsections In many problems it may be unrealisUc to 
assume that 6 is the value of a random variable, in other problems, even though 
it seems reasonable to assume that 0 is the value of a random variable 0 the 
distribution of O may not be known or even if it is known, it may contain other 
unknown parameters However, in some problems the assumption that the 
distribution of O is known is realistic, and we shall examine this situation 

7 1 Posterior Distribution 

Heretofore we have used the notation/(x, 0 ) to indicate the density of a random 
variable X for each 0 in 0 Whenever we want to indicate that the parameter 
0 is the value of a random variable O, we shall wtiIc the density of A"as/(x|0) 
instead of f{x 0) We should note that f(x |0) is a conditional density, it is 
the density of X given 0 = 0 A more complete notation for /{xj 0) would be 
Aie-tC*!®) 

Let X lt X m be a random sample o e size n from the density /{ |0), 
where 0 is the value of a random variable O Assume that the density of 0, 
g 6 { ), is known and contains no unknown parameters, and suppose that we 
want to estimate t(0) How do we incorporate the additional information of 
known g e ( ) into our estimation procedures? In the past, we thought of the 
likelihood function as a single expression that contained all our information, the 
likelihood function included the observed sample x p , x„ as well as the form 
of the density /(x 0) we sampled from in its expression Now we need an 
expression that contains all the information that the likelihood function con 
tains plus the added information of the known density g e { ) g 6 ( ) is called the 
prior distribution of O It summarizes what we know about 0 prior to taking 
a random sample WTiat we seek is an expression that summarizes what we 
know about 0 after we take a random sample We seek the posterior distnbu 
tion of O given X L -» x,, , X, *■ x. 

Definition 29 Trior and posterior distributions The density 9»( ) U 
called the prior distribution of O The conditional density of O given 
*i = *i. . denoted by / eJT , , x.) is 

called the posterior distribution of O //// 
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Remark 



for random sampling. [Recall that fr\x-x(y\x)=f x , t (x, y)lf x (x) = 

fxiT-J*\y)fi(y)lfx(x)-] ' nil 


The posterior distribution replaces the likelihood function as an expression 
that incorporates all information. If we want to estimate 9 and parallel the 
development of the maximum-likelihood estimator of 0, we could take as our 
estimator of 6 that 0 which maximizes the posterior distribution, that is, estimate 
6 with the mode of the posterior distribution. However, unlike the likelihood 
function (as a function of 0), the posterior distribution is a distribution function; 
so we could just as well estimate 6 with the median or mean of the posterior 
distribution. We will use the mean of the posterior distribution as our estimate 
of0, and in general we could estimate x(0) as the meanoft(0) given X t = x u . 

X n = x„; that is, take <?[t(Q)| = x u . . . , X n = x„] as our estimate of t (6). 

Definition 30 Posterior Bayes estimator Let X u . . . , X n be a random 
sample from a density f(x\0), where 6 is a value of the random variable 
© with known density <7 e (*)- The posterior Bayes estimator of x(0) with 
respect to the prior g Q (*) is defined to be 

*[<0)|*i,...,*J. (20) 

an 


Remark 




x„=*S e \x lt ...,xjd9 


jm nf(Xi\6) g e (0)d6 

flflXx^geWdO 


( 21 ) 

III! 


One might note the similarity between the posterior Bayes estimator of 
<0) = 0 and the Pitman estimator of a location parameter [see Eq. (17)]. 
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EXAMPLE 43 Let X u , X„ denote a random sample from the Bernoulli 
density /(*[ 0) = 0*(1 - 0) 1 "* for x = 0 1 Assume that the prior distri 
button of O is given by g G (0) = I l0 t) (8), that is, O is uniformly dtstnbu 
ted over the interval (0, 1) Consider estimating 0 and t( 0) = 0(1 -0) 
Now 


/e|Ti-jn 


x*=x m (8\ x i* 




(1 — fl>* 1X1 I tD n (0) 

lie^a-ey-^de ■ 


so the posterior Bayes estimator of 6 with respect to the uniform prior 
distribution is given by 


l[ 0 \X x -x xt = 

“Jfl/e ix»«xi r„«;c„(0| jc i» i x *)d8 

j i 0 eo zx a-oy 111 dd B(xx,+2, n -^, + i) 
P r, (l -ey tx ‘d9 B(Xx l + I 1 n-^i + 1 ) 

r(E*, + 2)r(« -£*,+ !) nn+2) 

r(« + 3 ) r(£x, + i)r(n-£*, + i) 

.. £*< + * 
n + 2 


Hence the posterior Bayes estimator of 0 with respect to the uniform prior 
distribution is given by (£ X, + 1 )/(« + 2) Contrast this to the 
maximum likelihood estimator of 0, which is £ XJn £ XJn is unbiased 
and an UMVUE, whereas the posterior Bayes estimator is not unbiased 
To obtain the posterior Bayes estimator of, say i(0) = 0(1 — 0), we 
calculate 


fMOJIXt-x* ,x„ = x„] 

»/0(l-0)/e|T,-x, ,X')dO 

!lo(t -B)o Zx {\-oy- T *‘dd 

_ r(E Jc,jh2)r(n - £ X, + 2) F(n + 2) 

r(n + 4) r(£ * { + l)r(n - £ x» + I) 

_ (£**+ PC" -Z*r + l) 

(n + 3)(« + 2) 

So the posterior Bayes estimator of 0(1 — 0) with respect to a uniform 
prior distribution is (£ X, + 1)(« - £ Jr, + !)/(« + 3)(« + 2) [III 
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We noted in the above example that the posterior Bayes estimator that 
we obtained was not unbiased. The following remark states that in general a 
posterior Bayes estimator is not unbiased. 

Remark Let T 0 =/c(^"i> • • • , X n ) denote the posterior Bayes estimator 
of t(< 3) with respect to a prior distribution G( • ). If both and r(0) 
have finite variance, then either var [T£|0] = 0, or T* is not an unbiased 
estimator of x[0), That is, either T* estimates t (0) correctly with proba* 

bility I, or Tq is not an unbiased estimator of t( 0 ). 

proof Let us suppose that is an unbiased estimator of t( 0); 
that is, £[Tc\0] = v (0). By definition we have Tj = X) — 

6[x(Q) \X U ...,X„]. Now 

var [TJ] = <5 [var [TS| ©]] + var [<?[T*|©]] 

= <?[var [TJI ©]] + var [r(©)], 
and 

var [t(0)] = 6 [var [x(0) | X it . . r„]] + var [<?[t(©) | X u . . . , X n ]] 

= ^[var [t(©)|^, . . . , X n ]] + var [T*]; 

hence, <f[var [Tj | ©]] + ^[var [7(0)1^,...,^]] = 0. Since both 
(?[var [Tc|0]] and <?[var [r(0)|A r 1 , .... X„]] are nonnegative and their 
sum is 0, both are 0. In particular, <f[var [TJI©]] = 0, and since 
var [Tel 0] is non-negative and has zero expectation, var [Tj|0] = 0. //// 

7.2 Loss-function Approach 

In Subsec. 3.4 we introduced the concepts of loss and risk. These two concepts 
were used to assess goodness of estimators. In this section we discuss how the 
additional information of knowledge of a prior distribution of 0 can be used 
in conjunction with loss and risk to define or select an optimum estimator. 

We commence with a review of the problem we hope to solve. Let 
X lr . . ., X„ be a random sample from a density f{x\ 0), 0 belonging to §, where 
the function /( • |0) is assumed known except for 0. We assume that the 
unknown 6 is the value of some random variable 0 and that the distribution of 
0 is known and contains no unknown parameters. On the basis of the random 
sample X u . . . , X„ we hope to estimate r(0), some function of 0. In addition, 
we assume that a loss function £(v, 0) has been specified, where £(t; 0) represents 
the loss incurred if we estimate x(0) to be t when 0 is the parameter of the 
density from which we sampled. For any estimator T = £(.Xi, ..., X£, we 


344 parametric point estimation 


vn 

noted in Subsec 3 4 that 0)] represented the average loss of that csti 

mator, and we defined this average loss to be the risk, denoted by of the 
estimator /( , , ) We further noted that two estimators, say T x = 

4(*i- * **) and T 2 = / 2 (X lt , XJ, could be compared by looking at their 

respective risks #*,(<>) and #,,(0), preference being given to that estimator -with 
smaller risk In general, the risk functions as functions of 6 of two estimators 
may cross, one risk function being smaller for some 6 and the other smaller 
for other 0 Then, since 6 is unknown, it is difficult to make a choice between 
the two estimators The difficulty is caused by the dependence of the risk 
function on 8 Now, since we have assumed that 8 is the value of some random 
variable O, the distribution of which is also assumed known, we have a natural 
way of re moving the dependence of the risk function on 0, namely, by averaging 
out the $, using the density of O as our weight function 

Definition 31 Bayes risk Let Xu , A"* be a random sample from a 
density f(x\0), where 6 is the value of a random variable O with cumula 
tive distribution function G( ) = <?e( ) and corresponding density 
ff( ) = 5'e( ) In estimating t(0), let /(f, 0) be the loss function The 
risk of estimator T = 4(X x , , X„) is denoted by 5?,(0) The Bayes risk 

of estimator T = 4(Xu , A',) with respect to the toss function , ) 
and prior cumulative distribution G( ), denoted by *(/) = \ t c (0. is 
defined to be 

( 22 ) 

till 

The Bayes risk of an estimator is an average risk, the averaging being over 
the parameter space © with respect to the prior density g{ ) For given 
loss function /( , ) and prior density g{ ) the Bayes risk of an estimator is a 
real number, so now two competing estimators can be readily compared by 
comparing their respective Bayes risks, still preferring that estimator with smaller 
Bayes risk. In fact, we can now define the "best" estimator of t (0) to be that 
estimator with smallest Bayes risk 

Definition 32 Bayes estimator The Bayes estimator of t( 0), denoted by 
o ** c(*i, > AQ, with respect to the loss function i{ , ) and 

prior cumulative distribution G( ), is defined to be that estimator with 
smallest Bayes risk Or the Bayes estimator of t(0) ts that estimator 
o satisfying 

*i c('*> —*{ c(^*,c) 5*/ c(^) 
for every other estimator T = /(X x , , X„) of t(0) 


till 
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The posterior Bayes estimator of t (0), defined in Definition 30, was defined 
without regard to a loss function, whereas the definition given above requires 
specification of a loss function. 

The definition leaves the problem of actually finding the Bayes estimator, 
which may not be easy for an arbitrary loss function, unsolved. However, for 
squared-error loss, finding the Bayes estimator is relatively easy. We seek that 
estimator, say ^*( * , . . . , * ), which minimizes the expression Jg 01$) g(6 ) dd = 
***» ) — 'z(0)] 2 ]g(0)d0 as a function over possible estimators 

/(*, •)* Now, 


j^ 0 Mx u ...,x n )-zm z W)dd 


= J • • • , X„) - T (0)] 2 f Xit 

n 

’ fx X„( x l> • • • > x n) n dx i 

/“I 


) 

,.,x„( x i x «\G)g(0) d0 ) 

fx xf x l>---> x n) I 


= • • • » X„)] 2 /e|Xi = xi x^ x f°\ x i x n) de 


• fx i dx <> 

i« 1 


and since the integrand is nonnegative, the double integral can be minimized 
if the expression within the braces is minimized for each x u x n . But the 
expression within the braces is the conditional expectation of [t(0) — J(x lt 
x n )] 2 with respect to the posterior distribution of 0 given X t ~x u 
X n ~x nj which is minimized as a function of /(x t , x n ) for J*(x lt x„) 

equal to the conditional expectation of r(0) with respect to the posterior distribu- 
tion of 0 given X t =x u X n =x n . {Recall that <?[(Z - a) 2 ] is minimized 
as a function of a for a * = <?[Z].} Hence the Bayes estimator of t ( 0) with re- 
spect to the squared-error loss function is given by 


*[*(0)1*1 — ^ij * • * * 




/ T(0) [fl/ (x ' |0) ] 5(0)d0 

ffn/(^i0)k 0 )^ * 

J u=i J 


( 23 ) 


which is identical to the estimator given in Eq. (21). 
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For a general loss function, we seek that estimator which minimizes 
!s &A0)s(0) d0 A 2 am > 

f 

J s 

= / g [Jj/W* 1 ’ * **)» ®)/xi X.(* 1 . ,x„\0)J\^dx^g(B)d0 

/x, x„(*i. 

and minimizing the double integral is equivalent to minimizing the expression 
within the brackets which is sometimes called the posterior risk So, in general, 
the Bayes estimator of t(0) with respect to the loss function t( » ) and prior 
density g{ } is that estimator which mtmmizes the posterior risk, which is the 
expected loss with respect to the posterior distribution of O given the observa- 
tions xi, , x„ We have the following theorem and corollaries 

Theorem 13 Let X u , X a be a random sample from the density 
/(x|0) and let ^(0) be the density of O Further let t(t, 0) be the loss 
function for estimating t( 0) The Bayes estimator of r(0) is that estimator 
, ) which minimizes 

//(/(*!, ,*,) 0)f e]X ,.„ y .-,„(0|x„ t x.)d9 (24) 

•'S 

as a function of , , ) //// 


Corollary Under the assumptions of Theorem 13, the Bayes estimator 
of t{ 0) is given by 


*[1(0)1*, - x„ 


.*, = *„] = 


Jt(0)[n/(x < ]0)] ff ( 0 ) d0 

/ j)f(x t \0) 9(0) do 


for a squared-error loss function 


(25) 

M 


Corollary Under the assumptions of Theorem 13, the Bayes estimator 
of 0 is given by the median of the posterior distribution of O for a loss 
function equal to the absolute deviation //// 

The proofs of the theorem and first corollary preceded the statement of the 
theorem The second corollary follows from the observation that 
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J 5 I° ^(*i> •••>*«) I/epr,**, x„=*„(0|*i ,...,x„)dd 

is minimized as a function of /(•,...,•) for ) equal t0 the median 

of the posterior distribution of 0. {Recall that S[\Z-a\] is minimized as a 
function of a for a * = median of Z.} 


EXAMPLE 44 Let X x% X n be a random sample from the normal density 
with mean 0 and variance 1. Consider estimating 0 with a squared-error 
loss function* Assume that © has a normal density with mean fi 0 and 
variance L Write pi 0 = x Q when convenient. According to Eq. (25) the 
Bayes estimator is given as the mean of the posterior distribution of 0. 


/eiXi*=x, 

fx x„ie(X|,...,x„|0)g(0) 


n/(*ii0)ko) 

Li« i J 


fx x - (Xl ’ • ■ • • ’ Xn) j [ft /(x,l 0)Jo(0) do 

(1 ly/W exp [-1 1 (*i - 0) 2 ](1 iJTn ) exp l-i(0 - p 0 ) 2 J 


f (1 lyfey exp -it (*/ - °) 2 Wj2n) exp [~i(0 - do) 2 ] < 

— m LI • 


exp 

-it(*/-0) 2 l 

1 = 0 J 


J 

fexp 

— oo 

f-it(*/-0> 2 

L f-o 

^0 


exp j— i (n + 1 ) 0 2 - 20£xj + x ? ] 

J* exp J -i J(n + 1)0 2 - 20fx s + £x?j J dO 

exp (-[(n + l)/2] Jo 2 -2 0 £ x,/(n + 1) + [t x iK n + *)] }) 
f exp ^-[(n + l)/2] 0 2 - 20 1 *;/(« + 0 + t *i/(« + ! )] }) dd 


/ 1 

UlJlnKn + 1)] exp !-[(« + l)/2] 0 - £ x,/(n + 1) j 

J [l/v^rr/O + 1)] exp f-K" + *)/ 2 l “ 2 x ‘K n + !) ] ) dB 

— CO ^ 


1 

*j2nl(n + 1 ) 


exp 
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Ihe denominator is unity since it is the integral of a density We have 
shown that the posterior distribution of O is normal with mean 
£5 xj(n + 1) and variance I /(« + I), hence the Bayes estimator of 0 with 
respect to squared-error loss is 


n + 1 _ ~n + 1 

Since the posterior distribution of O is normal, its mean and median art 
the same, hence 

A + flX, 

n+ 1 

is also the Bayes estimator with respect to a loss function equal to the 
absolute deviation //// 


EXAMPLE 45 Let X lt , be a random sample from the density /(xj0) =* 
(l/0)/ w f) (x) Estimate 9 with the loss function i{t, 0) = (t-0) l /0 1 
Assume that O has a density given by g(ff) - 7 t0 \0) Let y„ denote 
max I*ii » *»] Find the posterior distribution of O 

(wn/(o#d# 

/e|x,«* x„-*,.(0|*i» > *») = ~ n — “ . 1 — " 

f(l/9)-n/ ( o ,>{x t )dQ 

J o l-i 

■ n(g) 

fm r^( 0 )do 
am,. p(g> 

[l/(n - l)Xl/yi~ l - 1) 

We seek that estimator which minimizes Eq (24), or we seek that esti- 
mator /( ) which minimizes 

ftUCyQ-gWe^ir)/-,,. 

Ci/(« — 1)]( i/yr‘-l) 
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or that estimator which minimizes 


= [^ 0 v >] 2 —2 - 2*,„) £ ~ do + £ i d9. (26) 

Equation (26) is a quadratic equation in •); this quadratic equation 
assumes its minimum for 


, Hv s llm +1 )dO [l/(-n)Kl - 1/jfl n + l l-f n 


II II 


We note that the Bayes estimators derived in Examples 43 to 45 are 
functions of sufficient statistics. It can be shown that this is generally true; 
that is, a Bayes estimator is always a function of minimal sufficient statistics. In 
fact, under quite general conditions it can be shown that the Bayes estimator 
corresponding to an arbitrary prior probability density function, which is 
positive for all 9 belonging to <3, is consistent and BAN. So, even if you do not 
know the correct prior distribution, a Bayes estimator has some desirable 
optimum properties. And if you do know the correct prior distribution and 
accept the criterion that a best estimator is one that minimizes average loss, then 
the Bayes estimator corresponding to the known prior distribution is optimum. 

Even in those problems when the prior distribution is unknown, the 
concept of Bayes estimation can benefit us. It provides us with a technique of 
determining many estimators that we might not have otherwise considered. 
Each possible prior distribution has a corresponding estimator, whose merits 
can be judged by using our standard methods of comparison. Thus, we have 
yet another method of finding estimators to append to the methods given in 
Sec. 2. 

Bayes estimation can also sometimes be useful as a tool in obtaining an 
estimator possessing some desirable property that does not depend on prior 
distribution information. The property of minimax is such a property, and 
in the next subsection we will see how Bayes estimation can sometimes be used 
to find a minimax estimator. Another such property is given below. Our 
objective has been to minimize risk, but since risk depended on the parameter, 
we were unable to find one estimator that had smaller risk than all others for 
all parameter values. Minimax circumvented such difficulty by replacing the 
risk function by its maximum value and then seeking that estimator which 
minimized such maximum value. Another way of getting around the difficulty 
arising from attempting to uniformly minimize risk is to replace the risk function 
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by the area, under the risk function and to seek that estimator which has the 
least area under its risk function We note that if the parameter space g is 
an interval, the estimator having the least area under its risk function is the 
Bayes estimator corresponding to a uniform prior distribution over the interval g 
This is true because for a uniform prior distribution the Bayes risk is propor- 
tional to the area under the risk function, and hence minimizing the Bayes risk is 
equivalent to minimizing area 

7.3 Minimax Estimator 

We defined a mini max estimator at the end of Subsec 3 4 as an estimator whose 
maximum risk is less than or equal to the maximum risk of any other estimator 
Such an estimator might be considered “conservative” since it protects against 
the worst that can happen, it seeks to minimize the maximum risk The follow- 
ing theorem is sometimes useful in finding a minimax estimator 

Theorem 14 If T* = /*(A r I , , X„) is a Bayes estimator having con 

slant risk, that is, - constant, then T* is a mini max estimator 

proof Let g \ ) be the prior density corresponding to the Bayes 
estimator , , ) 

sup Af(0) = constant 

#cS 

« f dQ <, f d0 <1 sup 2lj$) 

« «g 

for any other estimator /( , , ) //// 


EXAMPLE 46 Find the mimmax estimator of 0 in sampling from the Bernoulli 
distribution using a squared error loss function We seek a Bayes esti 
mator with constant risk The family of beta distributions is a family of 
possible prior distributions We hope that for one of the beta prior 
distributions the corresponding Bayes estimator will have constant risk 
A Bayes estimator is given by 


i - oy^ri/Bfc, ipr-’q - oy i do 

ja^'a “ - or* m 


flo z *‘+'(i - oy-**'+ b ~ l do 

: jl fl E*, + A-l(l_ O ) n -E,, + *- 1<i0 

B(£xi + « + 1, b) + n 

B(£ *1 + a > n ~~ £ *1 + b) n + a + b 
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So the Bayes estimator with respect to a beta prior distribution having 
parameters a and b is given by 

ILXi + a 

n + a + b' 

We now evaluate the risk of (£ X, + a)/(« + a + b) with the hope that we 
will be able to select a and b so that the risk will be constant. Write 
J*,b( x i» • • • » x „) = A £ *£ + B = (X! x t + a)l(n + a + b); then SH* JO) 
+ B- 0*] = f[[A(Z X, - nO) + B - 0 + »A0] 2 ] = 
A 2 £[(Z x, - «0) 2 ] + (B - e + nAO) 2 = nA 2 0(l - 0) + (B - 0 + nAO) 2 = 
0 2 [(nA — l) 2 — /iA 2 ] + 0[nA 2 + 2{nA — 1)B] + B 2 , which is constant if 
(nA - l) 2 -nA 2 = 0 and nA 2 + 2(nA - 1)B = 0. Now ( nA - l) 2 - nA 2 

= 0 if A = ± J )» and nAl + 2 M - 1)B = 0 if B = 

—nA 2 j2(nA — 1), which is l/2( % /« + I) for A = + I). On 

solving for a and b, we obtain a = b = y/n/2; so QT X, + -Jnj2)j{n + jn) 
is a Bayes estimator with constant risk and, hence, minimax. //// 


8 VECTOR OF PARAMETERS 

In this section we present a brief introduction to the problem of simultaneous 
point estimation of several functions of a vector parameter. We will assume 
that a random sample X u . . . , X„ of size n from the density f(x\ 0j, . . . , 0*) is 
available, where the parameter 0 = (0 U ..., 0 k ) and parameter space 0 are 
/c-dimensional. We want to simultaneously estimate Xj(0), ..., x r (0), where 
t j(0),j » 1, r, is some function of 0 = (0 t , 0*). Often k = r, but this 
need not be the case. An important special case is the estimation of 0 — 
(Oj,..., 0 k ) itself; then r = k, and Xj(0) = 0 1 ,...,x k {0) = 0 k . Another important 
special case is the estimation of t(0); then r = 1. A point estimator of 
(x x (0), . . . , x r (0)) is a vector of statistics, say (7i,..., T r ), where 7} = / j(X u . . . , X„) 
and Tj is an estimator of x y (0). 

Our presentation of the method of moments and maximum-likelihood 
method as techniques for finding estimators included the possibility that the 
parameter be vector-valued. So we already have methods of determining esti- 
mators. What we need are some criteria for assessing the goodness of an esti- 
mator, say (T z , ..., T r ), and for comparing two estimators, say (T u T r ) 
and (37, ..., 77). As was the case in estimating a real-valued function t(0), 
where we wanted the values of our estimator to be close to x(0), we now want 
the values of the estimator (7), X) to be close to (x a (0), x r (0)). We 

want the distribution of (T x , . . . , T r ) to be concentrated around (x 1 (0), . . . , x r (0)). 
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There are l number of ways of measuring the closeness of an estimator For 
instance, in comparing two estimators the definitions of “more concentrated” 
and. “closer,” given in Subsec 3 1, can be generalized to r dimensions WewiU, 
however, restrict ourselves to consideration of unbiased estimators and define 
several ways of measuring the closeness of an unbiased estimator. No attempt 
will be made in this book to generalize to r dimensions the notions of loss and/or 
risk, invariance, Bayes estimation, and minimax As far as optimum estima- 
tion is concerned, we will be content to consider only unbiased estimators and 
look for a best estimator within the restricted class of unbiased estimators 

Pefinltion 33 Unbiased An estimator (T ( , , T t ), where 1} =* 

/j(X i, , XX j ~ 1, , r, is defined to be an unbiased estimator of 
(ti(0), » *,(#)) if and only if = t/ 0) for j « I, . , , t and for all 

0eg ill! 

In Sec 5, where we considered unbiased estimation of a real valued func- 
tion t( 0), we employed the variance of an estimator as a measure of its closeness 
to t (0) Here we seek a generalization of the notion of variance to r dimensions 
Several such generalizations have been proposed, we will consider four of them, 
called (i) vector of variances, (n) linear combination (with nonnegative coeffi 
cients) of variances, (m) elhpsoid of concentration, and (iv) Wilks' generalized 
variance The last two require some knowledge of matrices 

Possibly the simplest way of generalizing the concept of variance to 
rdimensionsis to use the vector of variances ofthe unbiased estimators 71, , T r 
That is, let the vector (var s [TJ, , var # [T,]) be a measure of the closeness of 
the estimator (T u , T,) to (^(0), , t,( 0)) The disadvantage of such a 

definition is that our measure is vector valued and consequently sometimes 
difficult to work with One way of circumventing this disadvantage i$ to use a 
linear combination of variances, that is, measure the closeness of the estimator 
(71, , T,) to (^(0), « , t,( 0)) with Xz - 1 a j var, [Tj] for suitabW chosen a, 2: 0 

Both of these generalizations of variance embody only the variances of the 
Tj,j = l, , r The Tj are likely to be correlated, so one might justifiably 
think that our measure of closeness of (T u , T r ) to (r t (ff), . , t,( 0)) should 
incorporate the covariances of the T ( * s 

Notation If (T t , , T r ) is an unbiased estimator of (r t (0), r,(0))» 

let a t 0) — cov* (Tj, Tj] The matrix whose fjth element is a,y(0) is called 
the covariance matrix of the estimator (T„ , T,) Let denote the 

yth dement of the inverse of the covariance matrix HI! 
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Definition 34 Ellipsoid of concentration Let (T u T r ) be an un- 
biased estimator of .... r r (0j). Let a IJ (0) be the ijth element of the 
inverse of the covariance matrix of (T u T r ), where the ijth element of 
the covariance matrix is cr u (ff) = cov 8 [T t , Tj\. The ellipsoid of concen- 
tration of (Ti, T r ) is defined as the interior and boundary of the ellipsoid 

i i o'KWi - mih - T/0)] = r+ 2. (27) 

i=l J- 1 

See Fig. 8 for r = 2. //// 


Loosely speaking, the ellipsoid of concentration measures how concen- 
trated the distribution of (7], T r ) is about (t x (0), .... t ,(9)). [In fact, if one 

considers the vector random variable, say (U u U r ), uniformly distributed 
over the ellipsoid of concentration, it can be proved that {U u U r ) and 
(T It ...,T r ) have the same first- and second-order moments.] The distribution of 
an estimator ( T u T r ) whose ellipsoid of concentration is contained within 

the ellipsoid of concentration of another estimator (T[, T r ') is more highly 
concentrated about (t,(0), .... t,(0)) than is the distribution of (T{, Tf). 

It is known that the determinant of the covariance matrix of an estima- 
tor is proportional to the square of the volume of the corresponding ellipsoid of 
concentration ; hence another generalization of variance is as in Definition 35. 


Definition 35 Wilks’ generalized variance Let (T u ..., Tj) be an 
unbiased estimator of .... T r (£>)). Wilks' generalized variance 

of (T, T) is defined to be the determinant of the covariance matrix 

or //// 
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Theorem 8, which showed how sufficiency could be used to improve on an 
arbitrary unbiased estimator, generalizes to r dimensions The generalization 
is stated without proof 

Theorem 15 Let X u . . , X n be a random sample from the density 
/(x; e„. . , Q k \ «sd M S, = • » *«)> • ■ * > S» ■ = <s m (X ^, .. .X.lfec 

a set of jointly sufficient statistics Let (T, T r ) be an unbiased esti- 
mator of (T t (0), .. , T,(0)) Define Tj = ^[7}|Si 5*], j = 1, .... r. 

Then, 

(i) (T/, • TJ is a statistic and an unbiased estimator of 

(t,(0), , t,( 0», and Vj » f/S it . , SJ, that is, T; is a function of the 

sufficient statistics S t , .. , S m ,j = 1, ... r. 

(n) var, [ 7JJ 5 var 9 [7}] for every Oe§,j-l r. 

(m) The ellipsoid of concentration of (T{ Tj) is contained in 

the ellipsoid of concentration of (T Xt , , T r ), for every 0 e g //// 

We might note that (iQ implies that 

pa } var # [T;j <; £ aj var, [7}] for 

and (hi) implies that Wilks’ generalized variance of (T(, 7^) is smaller than 
Wilks’ generalized variance of (T u . . , T r ). 

Theorem 10 of Sec. 5 can also be generalized to r dimensions, but first the 
coQcept of completeness has to be generalized 

Definition 36 Joint completeness For X lt .. , X„, a random sample 
from the density f(x, 0 t , . , 0*), let (7i, .... TJ be a set of statis- 
tics. Tj, . T„ are defined to be jointly complete if and only if 

Wn, ■ , TJ] s 0 for all 0 e § implies that P,WTj TJ = 0] s 1 

for all B e g, where *(T„ . , 7J is a statistic. //// 


EXAMPLE 47 Let X^ , Jf„ be a random sample ftom 

where 0j<0 2 . Write 0«(0 l , 0 2 ) Let = mm [A\ X w \ and 

y, = max [J'j, Jf„] We want to show that y, and Y„ are jointly 
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complete. (We know that they are jointly sufficient.) Let *(Y., V) be 
an unbiased estimator of 0, that is, 

‘“’eMl'i, T„)] = 0 for all 9 e 0. 

Now 

^n)] = 3 ? n)/y,, !'„(>' 1, J'n) d)’j dy n 



which is identically 0 if and only if 

pyn 

| *Oi, jOOn - J’i) n_z dy„ a o for 0! < 0 2 . 

•'fli J 0| 

Differentiate both sides with respect to 0 2 , and obtain 

/ *(yi» 0 z)( 02 -.fi) n ” 2 ^i s O for all 0j < 0 2 ; 

now differentiate both sides of the resulting identity with respect to 
0 1( and obtain —x(9 l , 0 2 )(0 2 - 9 l )"~ 2 = 0 for all 0 l <0 2 , and hence 
*(0i, 0 2 ) = 0 for 0! < 0 2 ; that is, x(y u y K ) = 0 for y t < y„, where y t 
and y„ are the possible values of Y t and Y „ . We have shown that Y t and 
Y„ are jointly complete. //// 

If the density f(x; 0 l5 ..., 0*) is a member of the ^-parameter exponential 
family, a set of jointly complete and sufficient statistics can be found using the 
following theorem. It is a /^-dimensional analog of Theorem 9 and is stated 
without proof. The following theorem is not precisely stated ; certain regularity 
conditions are omitted [16]. 

Theorem 16 Let X u . . . , X n be a random sample from/(x; 0 l5 ..., 0 k ). 

k 

If /(*; 0„ .... 9 k ) = a(9 1 , ..., 9 k )b(x) exp [£ c/0,, ..., 0*) that 

is,/(x; 0, , . . . , 9 k ) is a member of the ^-parameter exponential family, then 

( Y d k (X^, ...,y </ a (AT,)J is a minimal set of jointly complete and sufficient 

• • ' =1 llll 

statistics. //// 
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EXAMPLE 48 Let X lt , X, be a random sample from 

/(x, e„ «,) - K ..■(*> - ^ “P [-i(Sr)T 

Now 

Ms)l e,p (-&,+%)■ 

so f l X i and £Jf? are jointly complete and sufficient statistics by 
Theorem 16 //// 


We will state without proof the vector analog of Theorem 10 In the 
same sense that an UMVUE was optimum, this following theorem gives an 
optimum estimator for a vector of functions of the parameter. 

Theorem 17 Let X lf , X n be a random sample from f(x, ® lt ...» 0j) 
Write 6 ~(6 U ,0 k ) If S t = , df„), = a m (X lt .... Jfj 

ts a set of jointly complete sufficient statistics and if there exists an un- 
biased estimator of (t ,{0), , *,(0)), then there exists a unique unbiased 

estimator of (t,(0), , T r (0)), say T* = .... S„), .... T* >= 

, SJ, where each /* is a function of S m , svhich satisfies 

(i) var* [T*l £ var, [7)1 for every PeS, y = l, ,, , r, for any 
unbiased estimator (7), , 7" f ) of (ti(0), , r t (0)) 

(n) The ellipsoid of concentration of (Tj, , T*) is contained in 

the ellipsoid of concentration of (T u , T, ), where (T„ T t ) is any 
unbiased estimator of (t t (0), , r,(0)) //// 


There are four different maxima! subscripts, all of which are intended 
n denotes the sample sire, k denotes the dimension of the parameter 0, m is the 
number of real valued statistics in our jointly complete and sufficient set, and t 
is the dimension of the vector of functions of the parameter that we are trying 
to estimate In practice, it will turn out that usually k = m The estimator 
(T*. , Tj) is optimal in the sense that among unbiased estimators it is the 

best estimator using any of the four generalizations of variance that have been 
proposed 

Just as was the case m using Theorem 10, we have two ways of finding 
(T*, , 7"*) The first is to guess the correct form of the functions i*, 

which are functions of S L , . , S m , that will make them unbiased estimators of 
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Tj(0), t,(0). The second is to find any set of unbiased estimators of 

Tj(0), .... r r (0) and then calculate the conditional expectation of these unbiased 
estimators given the set of jointly complete and sufficient statistics. We employ 
only the first method in the following examples. 


EXAMPLE 49 Let X Jt . .., X„ be a random sample from the density 
/(•*> 0i> 02) — [1/(02 — 0i)]f(®i, »,)(*)• Suppose we want to jointly estimate 
the range and midrange, that is, x x (0) = 0 2 -0 l and t 2 (0) = (0 t + 0 2 )/2. 
We know that 7, = min [X u .... X n ] and 7„ = max [X u ..., X„] are 
jointly sufficient (see Example 23); also, they are jointly complete (see 
Example 47). Hence, to find the unbiased estimator ( T *, Tj) which has 
uniformly smallest variance for each component among all unbiased 
estimators, it suffices to find the unbiased estimator that is a function of 
the jointly complete sufficient statistics. Since S[Y ) ]=O i +(0 2 — 0i)/(«+ 1) 
and = 0 2 - (0 2 - 0,)/(» + 1), ([(« + 1 )/(« - 1)](7„- Y t ), (7, + 7„)/2) 
is the unbiased estimator of (0 2 - 0 lt (0 i + 0 2 )/ 2) that we are seeking. //// 


EXAMPLE 50 Let X tf ..., X„ be a random sample from the normal den- 
sity f(x ; 0J, 0 2 ) = i/) pi0 j(x). By Examples 22 and 48, £ 7, and £ A", 2 are 
jointly complete and sufficient statistics. Hence, by Theorem 17, 
QC Xi! n i Y. Wt — X) 2 /(« — I)) is an unbiased estimator of (/(, cr 2 ) whose 
corresponding ellipsoid of concentration is contained in the ellipsoid of 
concentration of any other unbiased estimator. [Noth: £ ( X \ — X ) 2 = 
Y X] - nX 2 ; so the estimator £ (X, - X) 2 /(n - 1) is a function of the 
jointly complete and sufficient statistics £ X x and £ Xj.] 

For this same example, suppose we want to estimate that function 
of 0 = (ji, a 2 ) satisfying the following integral equation: 

[ 4>„,Ax)dx = a 

J t(«) 

for a fixed and known. t( 0) is that point which satisfies P[X, > x (0)] = a; 
that is, it is that point which has 100a percent of the mass of the population 
density to its right, or t(0) is the (1 — a)th quantile point. We have 
1 - a = <P([x(0) - fil/er); so x {&) = p + z 2 - a o, where z,_ ff is given by 
3>(z t _«) = 1 - a. Since a is known, z, _ a can be obtained from a table 
of the standard normal distribution. To find the UMVUE of x (0), it 
suffices to find the unbiased estimator of p + z t _ a <r which is a function of 



358 parametric point estimation 


va 


£ x, and J) Xf We know that X is the UMVUE of ft, and it can be 
verified that 


ri(«-D/ 2 ] 

r(n/2)V2 


JL(Xt-X) 1 - r* 


say, is the UMVUE of a , hence X + z t , T* is the UMVUE of t(0) We 
have employed Theorem 17 for r - 1 , our vector of functions of the 
parameter that we wanted to estimate was umdimensiona! //// 


9 OPTIMUM PROPERTIES OF 

MAXIMUM-LIKELIHOOD ESTIMATION 

Several methods of finding point estimators were presented in Sec 2 of this 
chapter There, and in succeeding sections we have particularly emphasized 
the method of maximum likelihood In this section we will partially justify 
such emphasis by considering some optimum properties of maximum likelihood 
estimators 

For simplicity of presentation, let us consider the maximum likelihood 
estimation of the parameter 8, which is to be estimated on the basis of a random 
sample from a density/( , 0), where 8 is assumed to be areal number That is, 
let us consider the umdimensional parameter case and estimate 8 itself Recall 
that for the observed sample x lt , x n the maximum likelihood estimate of 8 
is that value, say 8, of 0 which maximizes the likelihood function L(9, x it , xj 

= /(*,, 0) Let 0„ = , X^ denote the maximum likelihood 

estimator of 8 based on a sample of size n We defined and discussed in Sec 3 
of this chapter a number of properties that an estimator may or may not possess 
Recall that some of these properties, such as unbiasedness and uniformly 
minimum variance, are referred to as small sample properties, and others of 
these properties, such as consistency and best asymptotically normal, are referred 
to as large sample properties The use of the word “ small” in “ small-sample” 
is somewhat misleading since a small sample property is really a property that is 
defined for a fixed sample size, which may be fixed to be either small or large 
By a large-sample property we mean a property that is defined in terms of the 
sample size increasing to infinity Our main result of this section will be con- 
tained in Theorem 18 below and will concern optimum large-sample properties 
of maximum likelihood estimation 
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We have already observed some small-sample properties of maximum- 
likelihood estimation. For instance, we have noted two things: first, that some 
maximum-likelihood estimators are unbiased and others are not and, second, 
that some maximum-likelihood estimators are uniformly minimum-variance 
unbiased and others are not. For example, in the density /(x; 0) = <£<,_ t (x) the 
maximum-likelihood estimator of 0 is X, which is the uniformly minimum- 
variance unbiased estimator of 0, whereas in the density /(x; 0) = (l/0)/ [O , e3 (x) 
the maximum-likelihood estimator of 0 is y„ = max [X u ..., X H ], which is 
biased. [We might note here that the Y„ in this last example can be corrected 
for bias by multiplying Y„ by (?i + l)/n and that the estimator that is thus 
obtained is uniformly minimum variance unbiased.] 

One property that it seems reasonable to expect of a sequence of esti- 
mators is that of consistency. Theorem 1 8 will show, in particular, that generally 
a sequence of maximum-likelihood estimators is consistent. 

Theorem 18 If the density /(x; 0) satisfies certain regularity conditions 
and if ©„ = 9 n (X lt ..., X n ) is the maximum-likelihood estimator of 0 for 
a random sample of size n from /(x; 0), then: 

(i) is asymptotically normally distributed with mean 0 and 
variance l/ntf 0 jJLlog/(.Y; 0)j j- 

(ii) The sequence of maximum-likelihood estimators €> r , . .., 0 n , 

• •.is best asymptotically norma! (BAN). //// 

We will not be able to prove Theorem 18. In fact, we have not precisely 
stated it, inasmuch as we have not delineated the regularity conditions. We 
do, however, want to emphasize what the theorem says. Loosely speaking, it 
says that for large sample size the maximum-Jikelihood estimator of 9 is as 
good an estimator as there is. (Other estimators might be just as good but not 
better.) 

We might point out one feature of the theorem, namely, that the asymptotic 
normal distribution of the maximum-likelihood estimator is not given in terms of 
the distribution of the maximum-likelihood estimator. It is given in terms of 
/(*;£?) 7 the density sampled. Also, the variance of the asymptotic normal 
distribution given in the theorem is the Cramer-Rao lower bound. 


EXAMPLE 51 Let X n be a random sample from the negative 

exponential distribution /(*; 0) = %>.„)(*). It can be routinely 

demonstrated that the maximum-likelihood estimator of 9 is given by 
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„£ x, =■ l/X, According to Theorem 18 above, the maximum-likeli- 
hood estimator has an asymptotic normal distribution with mean 0 and 
variance equal to 

1 6 1 

(See Example 29) //// 

We have ordinarily considered estimation of i(G) some function of 0, 
rather than estimation of 0 itself For maximum likelihood estimation, we 
noted (see Theorem 2) that the maximum likelihood estimator of r(0) was given 
by t{ 0), where 0 was the maximum likelihood estimator of 0 If we assume 
that if ) is differentiable, then it can be shewn that xfO) has an asymptotic 
normal distribution with mean i(0) and variance 

which is the Cranfer Rao lower bound (See Theorem 7 ) 

Maximum likelihood estimators possess similar optimum large sample 
properties in the case of a k dimensional parameter For instance, it can be 
proved (again under regularity conditions) that the joint distribution of the 
maximum likelihood estimators is asymptotically distributed ss a multivariate 
normal distribution Let us illustrate for the case when k = 2, that js, 
0 = (0 t 0j) Recall that the bivariate normal distribution is specified by the 
five parameters jq, p 2t o*, o\ and p (See Sec 5 of Chap IV ) It turns out 
that under certain regularity conditions the joint distribution of the maximum 
likelihood estimators 0, and 0 2 is asymptotically distributed as a bivariate 
normal distribution with parameters jq = 9 t , p t — 0 2 • 

1 nA ' 

, -*$*'*•% 

iS * 
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and 


= 




n A 


where 




EXAMPLE 52 Let Xy, . . . , X„ be a random sample from the density 

/(x; 0) =/(x; By, 0 2 ) = *, lifa (x) = -^= 

\J 7.1tB2 

We have already derived, in Example 6, the maximum-likelihood estimators 
of By and 0 2 ; they are, respectively, 

and § 2 = ©i) 2 . 

n x n i 

According to the above, the asymptotic large-sample joint distribution of 
©i and © 2 is a bivariate normal distribution with means 6 X and 0 2 . 
Since log f(X; 6) = log 2ir - } log 0 2 - (1/20 2 )(X- By) 2 , the required 
derivatives are 

iL,,og/(* 

^Liog/ws)=-^, 


and 


and because 


i 2 

501 


_ 2 log/(X;0) = r^- 


I_ (AT-0i) 2 

201 ' 


0^ 


<? [2f] = By and <?| [(2f - 0 X ) 2 ] = 0 2 » 

-4m' ce/(X;e) hi’ 

-4Se:' oeme> h 0 ’ 
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and 


which gives A ** 1/20| 


_ ,oejf<jr - “ 251 ■ 

Finally, then, c\ = Ojn, e\ = 2Q\fn, and p = 0 

III! 


PROBLEMS 

1 An um contains black and white balls A sample of size nn drawn with replace* 
tnent What is the maximum likelihood estimator of the ratio R of black to 
white balls in the urn? Suppose that one draws balls one by one with replacement 
until a black ball appears Let X be the number of draws required (not counting 
the last draw) This operation is repeated n times to obtain a sample A'j, Xt, , 
X g What is the maximum likelihood estimator of R on the basis of this sample? 

2 Suppose that n cylindrical shafts made by a machine aie selected at random from 
the production of the machine and their diameters and lengths measured U Is 
found that N n have both measurements within the tolerance limits, Nn have 
satisfactory lengths but unsatisfactory diameters, Nn have satisfactory diameters 
but unsatisfactory lengths and N lt are unsatisfactory as to both measurements 
2 Nij = n Each shaft may be regarded as a drawing from a multinomial popula- 
tion with density 

PnPVtPiV^ -~Pit —/’I* -Pn)’ 11 for Xu *=0, 1, 2 Xu = 1 
having three parameters What are the maximum likelihood estimates of the 
parameters if Nn *® 90, Nn = 6 Nn “ 3, and Nn ***17 

3 Referring to Prob 2 suppose that there is no reason to believe that defective 
diameters can in any way be related do defective lengths Then the distribution 
of the Xtj can be set up in terms of two parameters pi, the probability of a satis- 
factory length and q lt the probability of a satisfactory diameter The density of 
the X„ is then 

0pi?i)* i, tPi(i -/>.)9.r ,, Ki -R.XI -<r.)r M 

for xu = 0,1, 2 * 0 *= I 

What arc the maximum likelihood estimates for these parameters? Are the prob- 
abilities for the four classes different under this model from those obtained in the 
above problem? 

4 A sample of size n, is to be drawn from a normal population with mean and 
variance a[ A second sample of size w* is to be drawn from a normal population 
with mean p, and variance o) What is the maximum likelihood estimator of 
& —pi— pt ’ If we assume that the total sample stze n = ffj + n» is fixed how 
should the n observations be divided between the two populations in order to 
minimize the variance of the maximum likelihood estimator of 0? 
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5 A sample of size n is drawn from each of four normal populations, all of which 
have the same variance a 2 . The means of the four populations are a + b + c, 
a + b— c, a — 6 + c, and a — b — c. What are the maximum-likelihood estima- 
tors of a , b , c, and cr 2 ? (The sample observations may be denoted by X tJ i = 1 
2, 3, 4 andy = 1, 2, 

6 Observations Xu X 2 , . . . , X„ are drawn from normal populations with the same 
mean y but with different variances <r], o \ , . . . , a 2 . Is it possible to estimate all 
the parameters? If we assume that the c 2 are known, what is the maximum- 
likelihood estimator of yl 

7 The radius of a circle is measured with an error of measurement which is dis- 
tributed 7V( 0, a 2 ), a 2 unknown. Given n independent measurements of the 
radius, find an unbiased estimator of the area of the circle. 

8 Let J be a single observation from the Bernoulli density f(x; 8) = 

6 x (l — 0) 1 !,(*), where 0 < 0 < 1. Let t x {X) = X and / 2 (X ) = 

(a) Are both Ji(X) and / 2 (X) unbiased? Is either? 

(b) Compare the mean-squared error of J x (X) with that of J 2 (X). 

9 Let Xu X 2 be a random sample of size 2 from the Cauchy density 


/(*;« = - 


l 


— co < 8 < co. 


'7T[\ + (x-8yy 

Argue that (X x 4- X 2 )I2 is a Pitman closer estimator of 8 than X\ is. [Note that 
(A'i + X 2 )I2 is not more concentrated than X x since they have identical distribu- 
tions.] 

10 Let 8 denote some physical quantity, and let X u •••■> X n denote n measurements of 
the physical quantity. If 8 is estimated by &, then the residual of the ith measure- 
ment is defined by Xt — 0, / — 1, , n. Show that there is only one estimator 
with the property that the residuals sum is 0, and find that estimator. Also, 
find that estimator which minimizes the sum of squared residuals. 

11 Let Xu X n be a random sample from some density which has mean y and 
variance a 2 . 


fo) 


Show that 2 at Xt is an unbiased estimator of y for any set of known constants 

i 

n 

a u . . * , Q n satisfying 2 K !• 


(b) If 2 at = 1, show that var [ 2 Xt] is minimized for a x - 1 In, i = 1, .. . 


[Hint: Prove that 2 at = 2 ““ l/ 77 ) 2 + w ^ en 2 

L 1 1 1 

12 Let Xi , .... X„ be a random sample from the discrete density function f(x; 8) = 
where 0^6 £h. Note that © = {8: 0^8<,i). 

(a) Find a method-of-moments estimator 8, and then find the mean and mean- 
squared error of your estimator. 

(b) Find a maximum-likelihood estimator of 8, and then find the mean and 
mean-squared error of your estimator. 
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13 Let Xi X 2 be a random sample of size 2 from a normal distribution with mean B 
and variance 1 Consider the following three estimators of 6 

T t = A(X l ,X 1 ) = W l + iX 2 
T 2 = t 2 {X u X 2 ) = iX l +\X 2 
i,X 2 ) = iX 2 +lX 2 


(a) For the loss function Aj, 6) = 30*(r — 0)*, find 0! { f8) for t = ], 2, 3, and 
sketch it 

(&) Show that r, is unbiased for i =* 1, 2, 3 
14 Let Xi , X„ he independent and identically distributed random variables 
from some distribution for which the first four central moments exist We know 
that & [S 1 ] = o 1 and 

where 


Is S 1 a mean squared error consistent estimator of cr 1 ? 

15 In genetic investigations one frequently samples from a binomial disttibution 

/w=C) P v- * except that observations of x *» 0 are impossible , so in fact, the 
sampling is from the conditional (truncated) distribution 




Find the maximum likelihood estimator of p m the cast m — 2 for samples of 
size n Is the estimator unbiased 7 

16 Let A* be a single observation from JVfO, 6) (8 — u 1 ) 

(a) Is AT a sufficient statistic? 

(i) Is \X\ a sufficient statistic? 

(c) Is X x an unbiased estimator of 07 

(</) What is a maximum likelihood estimator of V 01 ~ 

(e) What is a method or moments estimator of V0? 

17 Let X have the density /(*,© «= - $)' „ „(*), 

t>efine^x)=*2I m W 

(а) Is V a sufficient statistic? A complete statistic? 

(б) Is | X\ a sufficient statistic? A complete statistic? 

(c) What is a maximum likelihood estimator of 0? 

(rf) Is T** AX) an unbiased estimator of 07 

(e) Does /( je, 0) belong to an exponential class ? 

if) Find an estimator with uniformly smaller mean squared error than that of 
AX), if such exists 
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18 Let A i, Xj » ...» A„ be a random sample from the density 

where »>o. W 


(fl) Find a maximum-likelihood estimator of 0. 

(£) Is Fi = min [X u . , . , X„] a sufficient statistic? 

19 Let ATi, AT* be a random sample from /( jc; 0)==ie" lx ’ fl t, -oo <0 <oo, 

(a) Discuss sufficiency for this density. 

(b) Obtain a method-of-moments estimator of 0. 

(c) Find a maximum-likelihood estimator of 0. 

(d) Docs f(x; 6) belong to an exponential class? 

20 Find a maximum-likelihood estimator for a in the density f(x; a) = 

(2/a 2 ) (a — «)(*) for samples of size 2. Is it a sufficient statistic? Estimate a 

by the method of moments. What is the maximum-likelihood estimator of the 
population mean? 

21 Let Xu •••» Xn be a random sample from f(x\ 0) “(l/0)/ to .ej(;t), where 0 >0. 

Define Y„ = max [A'i, . . . , Af«] and Fi =min [Ai ATJ. 

(o) Estimate 0 by the method of moments. Call the estimator T u Find its 
mean and mean-squared error. 

(6) Find the maximum-likelihood estimator of 0. Call the estimator T 2 . Find 
its mean and mean-squared error. 

(c) Among all estimators of the form a Y K , where a is a constant which may depend 
on n t find that estimator which has uniformly smallest mean-squared error. 
Call it T z . Find its mean and mean-squared error. 

( d) Find the UMVUE of 0. Call it T * . Obtain its mean and mean-squared 


error. 

(<?) Let T s — Y t -f- Y*. Find the mean and mean-squared error of T 5 . 

(/) What estimator of 0 would you use and why? 

(g) Find the maximum-likelihood estimator of the variance of the population. 

22 Let A r Jl ..., Afubca random sample from the Bernoulli distribution, say P[X = 1] = 
0 = 1— P[X = 0], 

(а) Find the Cramdr-Rao Jower bound for the variance of unbiased estimators of 

0 ( 1 - 0 ). 

(б) Find the UMVUE of 0(1 - 0) if such exists. 

23 Assuming r known, find the maximum-likelihood estimator for A for a random 
sample of size n from a gamma distribution. Find a sufficient statistic if one 
exists. Is your maximum-likelihood estimator unbiased? Is there an UMVUE 
of A? 

24 Let X u . . . , X* be a random sample from 0x ff ‘ , / ( o. «(*)» where 0 > 0. 

(a) Find the maximum-likelihood estimator of fi — 0/(1 + 0). 

( b ) Find a sufficient statistic, and check completeness. Is 2 a sufficient 
statistic? 

(c) Is there a function of 0 for which there exists an unbiased estimator whose 
variance coincides with the Cramdr-Rao lower bound ? 

*(d) Find the UMVUE of each of the following: 0, 1/0, n = 0/(1 + 0). 
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C) 


p'O —p) m \x 


k a random sample from the binomial distribution 
= 0 1, ,m where m is known and 0 <£p ^ 1 


(а) Estimate p by the method of moments and the method of maximum likelihood, 

(б) Is there an UMVUE ofp? If so, find it 

*26 Let Xi , X, be a random sample from the discrete density function 


where 0 *= 1 2 That is S = {0 0 = 1,2, } = the set of positive integers 

(a) Find a method of moments estimator of 6 Find its mean and mean-squared 
error 

(4) Find a maximum likelihood estimator or 4 Find its mean and mean squared 
error 

(e) Find a complete sufficient statistic 

(d) Let T~ Y„ the largest order statistic Show that the UMVUE of 8 is 

IT"** - (t- vy - (T- iyi 

27 Let A" be a single observation from the density fl)]** '(I — jc)' -1 /,, „{x). 

Is A" a sufficient statistic? Is X complete? 

2$ An experimenter knows that the distribution of the lifetime of a certain component 
is negative exponentially distributed with mean 1(8 On the basis of a random 
sample of size n of lifetimes he wants to estimate the median lifetime. Find both 
the maximum likelihood and uniformly minimum-variance unbiased estimator of 
the median 

29 Let Xi, , X, be a random sample from N{8, 1) 

(a) Find the Cramer Rao lower bound for the variance of unbiased estimators of 
6 0*,andP[X>Q] 

(b) Is there an unbiased estimator of 0* for n = 1 ? If so, find it 

(c) Is there an unbiased estimator ofF[A'>0]7 If so, find it 

(d) What is the maximum likelihood estimator of PI A >0)7 

(e) Is there an UMVUE of 8* 7 If so find it 

(/) Is there an UMVUE or PEA - >0]? If so, find it 

30 For a random sample from the Poisson distribution, find an unbiased estimator of 
-KA) = (1 + A)e _1 Find a maximum likelihood estimator of t(A) Find the 
UMVUE of r(A) 

31 Let Xi , , X, be a random sample from the density 


/(*»*)= 0iW*) 


where 8 > 0 

(a) Find a maximum likelihood estimator of 0 

(4) Is F. *= max [A",, , AfJ a sufficient statistic? Is r» complete? 

(c) Is there an UMVUE of 0? If so, find it 
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32 Let X\ » ••• > X n be a random sample from the density 

fix; 0) - 0(1 + *)-« + ^ )Cv) for 0 > Q 

(^) Estimate 0 by the method of moments assuming 8 > 1. 

(6) Find the maximum-likelihood estimator of 1 Id . 

(c) Find a complete and sufficient statistic if one exists, 

W Find the Cramer-Rao lower bound for unbiased estimators of 1/0 . 

(e) Find the UMVUE of 1/0 if such exists. 

(/) Find the UMVUE of 0 if such exists. 

33 Let Xu • • . i A'n be a random sample from 

fix ; 0) = 2 ^ A-ff.oiW for 0 > 0. 

(n) Find a maximum-likelihood estimator of 0. 

(6) Suppose /i = 1 , so that you have only one observation, say X = Xi . Clearly 
A' is a sufficient statistic. Is X a minimal sufficient statistic? Is A" complete? 

34 Let Xu ...» A"* be a random sample from the negative exponential density 

/(*; 0 ) « oo, W. 

(a) Find the uniformly minimum-variance unbiased estimator of var [A*] if such 
exists. 

(b) Find an unbiased estimator of 1/0 based only on F{ n> = min [X u ..., X n ]. 
Is your sequence of estimators mean-squared-error consistent? 

35 Let Xu . * . * X n be a random sample from the density 

log 0 

f(x;8)=j^d x I {0 .n(x\ 8> 1. 


36 


(a) Find a complete sufficient statistic if there is one. 

(b) Find a function of 0 for which there exists an unbiased estimator whose 
variance coincides with the Cramer-Rao lower bound if such exists. 

Show that 




' a i* 

;F0 log/(*;*)) 



log f(X;8) . 


37 Let X u ... , X„ be a random sample from the density 

/(*; 8) = e~ {t ~ 0) exp (—e~ (x ~ e) ), 

where — oo < 6 < oo. 

(a) Find a method-of-moments estimator of 8. 

(b) Find a maximum-likelihood estimator of 8. 

(c) Find a complete sufficient statistic. 

{d) Find the Cramer-Rao lower bound for unbiased estimators of 8. 

(e) Is there a function of 8 for which there exists an unbiased estimator, the 
variance of which coincides with the Cramer-Rao lower bound ? If so, find it. 
•(f) Show that r'(n)/r(/;) - log (2 <?"*') is the UMVUE of 8. 
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38 Let Xi, , X, denote a random sample from 

fix, 8) =/,(*) = 6Mx ) + (I - *)/»(*), 

where 0 ^ 0 ^ 1 and/,( )and/ 0 ( ) arc known densities 

(a) Estimate 6 by the method of moments 
(h) For n *= 2, find a maximum likelihood estimator of 9 
(e) Find the Crarr.dr Rao lower bound for the variance of unbiased estimators 
of 6 

39 Suppose that of ) and 6{ ) are two nonnegative functions such that /(. x, 8 ) a 
c(0)&(*)/<ir «(*) « a probabihty density function for each 6> 0 

(a) What is a maximum likelihood estimator of 9 7 
(t) Is there a complete sufficient statistic? If so, find it 
(c) Is there an UMVUEof 0? If so, find it 

40 Let A\, , X, be a random sample from N(9, 8) 8> 0 

(a) Find a complete sufficient statistic if such exists 

( b ) Argue that X is not an UMVUE of 9 
(e) Is 6 either a location or scale parameter? 

41 Let Xi , X, be a random sample from N{9, 6 1 ), — co <6 < co 
(a) Is there a urn dimensional sufficient statistic? 

(ft) Find a two-dimensional sufficient statistic. 

(c) Is .Fan UMVUE of 61 {Hint Find an unbiased estimator of? based on S’, 

call it T* Fmd a constant a to minimize var [aX + (1 — } 

(d) Is 6 either a location or scale parameter? 

42 Let Xi, , X, be a random sample of size n from the density 

fix,6)=\h> ufx), e> 0 
(a) Find a maximum likelihood estimator of 9 

(i ft ) We know that F, and F» are jointly sufficient Are they jointly complete? 

(c) Fmd the Pitman estimator for the scale parameter 0 

(d) For a and ft constant (they may depend on n), find an unbiased estimator of 
9 of the form aYt + bY. satisfying P[F./2 <1 aFi + h F. <: Fi] *» 1 if such 
exists Why is P[YJ1 £aY t +bY.£, F,] «= I desirable? 

43 LetZj, , Z„ be a random sample from 7/(0, 0 1 ), 0 > 0 Define X% = \Z t \, and 
consider estimation of V ana t* on the basis of the random sample X t , . , X t 

(а) Find the UMVUE of 6* if such exists. 

(б) Find an estimator of 6 1 that has uniformly smaller mean squared error than 
the estimator that you found in part (a) 

(c) Find the UMVUE of 9 if such exists 

(d) Find the Pitman estimator for the scale parameter 9 

(e) Does the estimator that you found in part (d) have uniformly smaller mean- 
squared error than the estimator that you found m part (c)? 
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44 Let Xi 9 .* . , X n be a random sample from 

f(x; 9) “*>/[*,«>)(*) for — oo <0< co. 

(а) Find a sufficient statistic. 

(б) Find a maximum-likelihood estimator of ft 
(c) Find a method-of-moments estimator of 6. 

id) Is there a complete sufficient statistic? If so, find it. 

(e) Find the UMVUE of 8 if one exists. 

(J) Find the Pitman estimator for the location parameter ft 

{g) Using the prior density g{6) = e-I (0 . a) (8), find the posterior Bayes estimator 
of ft 

45 Let Xu Xn be a random sample from fix 1 0) = ftx* - */<*. i >(*), where 8 > 0. 
Assume that the prior distribution of 0 is given by 

0oi9) « [l/T(r)]A^^e-«/ (0 . *>(0), 


where r and A are known. 

(а) What is the posterior distribution of 0? 

(б) Find the Bayes estimator of 8 with respect to the given gamma prior distribu- 
tion using a squared-error loss function. 

46 Let A" be a single observation from the density fix\9) = (2x/0 2 )/ (O , <?)(*), where 
8 > 0. Assume that 0 has a uniform prior distribution over the interval (0, 1). 
For the loss function ( it ; 8) = 8 2 it — 0 ) 2 , find the Bayes estimator of ft 

47 Let Xu X, ...» X be a random sample of size n from the following discrete 
density: 


fix; 6) « M e*(l - 0) J -J (0 . *,(*), 


where 0 > 0. 

(a) Is there a unidimensional sufficient statistic? If so, is it complete? 
ib) Find a maximum-likelihood estimator of 9 2 = P[X t ~ 2]. Is it unbiased ? 
(c) Find an unbiased estimator of 8 whose variance coincides with the correspond- 
ing Cramer-Rao lower bound if such exists. If such an estimate does not 
exist, prove that it does not. 

id) Find a uniformly minimum-variance unbiased estimator of 9 2 if such exists, 
(e) Using the squared-error loss function find a Bayes estimator of 9 with respect 
to the beta prior distribution 

if) Using the squared-error loss function, find a minimax estimator of ft 

ig) Find a mean-squared error consistent estimator of 9 2 , 
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48 Let X u , X» be a random sample from a Poisson density 



where 0 > 0 For a squared-error loss function find the Bayes estimator of 6 for 
a gamma prior distribution. Find the posterior distribution of 0 Find the 
posterior Bayes estimator of t(5) =P[X t = 0] 

49 Let Xi, , X. be a random sample from/(x|0) = (l/tf)/<o *>(x), where 0>O 
For the loss function (r — 0)*/^* and a prior distribution proportional to 
0“"/u «>(£) find the Bayes estimator of 6 

50 Let Xi, , X n be a random sample from the Bernoulli distribution Using the 
squared-error loss function, find that estimator of 8 which has minimum area 
under its risk function 

51 Let df t » , X, be a random sample from the geometric density 

f(x, 0) = 0(1 - 0)V rt 1 _,(*) 

where 0 < 8 < 1 

(a) Find a method-of moments estimator of 8 
(A) Find a maximum likelihood estimator of 8 
(c) Find a maximum likelihood estimator of the mean 
{d) Find the Cram£r Rao lower bound for the variance of unbiased estimators of 
1-8 

( e ) Is there a function of 8 for which there exists an unbiased estimator the 
variance of which coincides with the Cramir Rao lower bound? If so. 
find it 

(/) Find the UMVUE of (1 — 8)18 if such exists 
(ff) Find the UMVUE of 8 1 f such exists. 

(A) Assume a uniform prior distribution and find the posterior distribution of 0 
For a squared-error loss function, find the Bayes estimator of 8 with respect 
to a uniform prior distribution 

52 Let 8 be the true I Q of a certain student To measure his I Q , the student takes 
a test, and it is known that his test scores are normally distributed with mean p and 
Standard deviation 5 

(а) The student takes the I Q test and gets a score of 1 30 What is the maximum- 
likelihood estimate of 0? 

(б) Suppose that it is known that I Q ’s of students of a certain age are distri- 
buted normally with mean 100 and variance 225, that is, © - N(100, 225) 
Let X denote a student’s test score [X is distributed N{8, 25)] Find the pos- 
terior distribution, of O given X-=x What is the posterior Bayes estimate 
of the student s I Q if X — 130 

*53 Let X,, , X, be a random sample from the density 

Ax , «. 6) =/(x, p) 



w j, ere _ oo < a < oo and /3 > 0. Show that Y t and 2 X, are jointly sufficient. 
It can be shown that 7, and 2 (Xt — 7|) are jointly complete and independent of 
each other. Using such results, find the estimator of (a, j9) that has an ellipsoid of 
concentration that is contained in the ellipsoid of concentration of any other un- 
biased estimator of (a, 0). (7, = min [7i, .... 7„].) 
y Let Xi,..., X„ be a random sample from the density 

/(x;a, 0) = (l-0)0 I - I /„., +l . i(x), 

where — oo < a < oo and 0 < 6 < 1. 

(o) Find a two-dimensional set of sufficient statistics. 

*(b) Find the maximum-likelihood estimator of (a, 6). 

55 Let Xu • • • • X n be a random sample from the density 

„ lx , 2(1 -xK 

fix', ®) = -Q Ao.»l(*) + J _ 0 Ao.ljW. 


where 0 < 8 <L 1 . 

(o) Estimate 8 by the method of moments. 

(6) Find the maximum-likelihood estimator of 0 for « = land n = 2. 

(c) For n = 1 find a complete sufficient statistic if such exists. Find a UMVUt 
of 8 for n = 1 if such exists. 

*(rf) Find the maximum-likelihood estimator of 8. 


VIII 

PARAMETRIC INTERVAL ESTIMATION 


1 INTRODUCTION AND SUMMARY 

Chapter VII dealt with the point estimation of a parameter, or more precisely, 
point estimation of a value of a function of a parameter Such point estimates 
are quite useful yet they leave something to be desired In all those cases when 
the point estimator under consideration had a probatulity'density ju nctio n^ the 
prcfbabihlyThatthreiflnTatOTacttiallyTrqrialedlhevaTue^l' the parameter being 
estimate d was 0 (Tfieprobability that a cbnfinuoUsTandbrrf varlabTTequals 
anyone valuers 0 ) Hence, it seems desirable that » point estimate should be 
accompanied by some" measufe“orthe‘15bssible^rror oflhe estimate For in- 
stance, a point estimate mighfbeliccofnpajrietFby^tmieinterYtirabout the point 
estimate together with some measure of assurance that the tr ue value of th e - 
^ param eter li es wit hin thc interval Instead pTRiaUng themference of estimating 
tSe true valueof the parameter to be a point, we might make the inference of 
estimating that the true value of the parameter is contained in somd interval 
We then speak of intenal estimation, which is to be the subject of this chapter 
Like point estimation the problem of interval estimation is twofold First, 
there is the problem of finding interval estimators, and, second, there is the prob- 
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Iem of determining good, or optimum, interval estimators. The considerations 
of these two problems that will appear in this chapter will be incomplete. 
Further considerations will be presented at the end of the next chapter on testing 
hypotheses. The mathematics of interval estimation and hypotheses testing are 
closely related. Either concept could be used to introduce the other. In this 
book, we have decided to introduce interval estimation first, right after our 
presentation of point estimation, then introduce hypotheses testing, and finally 
point out the close mathematical relationship between the two. 

The introduction to interval estimation that appears in this chapter will 
not be as thorough as was our discussion of point estimation in the last chapter. 
One should not infer from this that interval estimation is less important since in 
practice the opposite is usually true. It is just easier to present the basic theory 
of point estimation. No concerted effort will be given to the problem of finding 
optimum interval estimators. The chapter will be divided into six main sections, 
the first being this introductory section. Section 2 will be devoted to confidence 
intervals, where the notion is introduced and defined. One method of finding 
confidence intervals will also be given as well as some idea as to what an optimum 
confidence interval might be. Section 3 will consider several examples of con- 
fidence intervals that are associated with sampling from the normal distribution. 
Such discussion will hinge on the results of Sec. 4 of Chap. VI. Several general 
methods of finding confidence intervals are given in Sec. 4; another method, 
which utilizes the theory of hypotheses testing, will be given at the end of Chap. 
IX. A brief discussion of large-sample confidence intervals appears in Sec. 5, 
and Sec. 6 presents another type of interval estimation, namely, Bayesian interval 
estimation. 


2 CONFIDENCE INTERVALS 


2.1 An Introduction to Confidence Intervals 

In practice, estimates are often given in the form of the estimate plus or minus a 
certain amount. For instance, an electric charge may be estimated to be 
(4.770 ± .005)10” 10 electrostatic unit with the idea that the first factor is very 
unlikely to be outside the range 4.765 to 4.775. A cost accountant for a publish 
ing company in trying to allow for all factors which enter into the cost of produc- 
ing a certain book (actual production costs, proportion of plant overhead, pro- 
portion of executive salaries, etc.) may estimate the cost to be 83 ± 45 cents per 
volume with the implication that the correct cost very proba ly ies etween 
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78 5 and 87 5 cents per volume The Bureau of Labor Statistics may estimate 
the number of unemployed in a certain area to be 2 4 + 3 million at a given 
time, feeling rather sure that the actual number is between 2 1 and 2 7 million 
What we are saying is that in practice one is quite accustomed to seeing estimates 
in the form of intervals 

order to give precision to these ideas, we shall consider a particular 
example Suppose that a random sample (1 2, 3 4, 6, 5 6) of four observations 
is drawn fronr a normal population with an unknown mean n and a known 
standard deviation 3 The maximum likelihood estimate of f i is the mean of the 
sample observations \ 

x = 27 


We wish to determine upper and lower limits which are rather certain to contain 
the true unknown parameter value between them 

In general, for samples of size 4 from the given distribution the quantity 



Will be normally distributed with mean 0 and unit variance X is the sample 
mean, and $ is cA/n Thus the quantity Z has a density 


/z « - m - 


>j2n 


which is independent of the true value of the unknown parameter, so we can 
compute the probability that Z will be between any two arbitrarily chosen 
numbers Thus for example, 

P[—l 96 <Z < I 96] « f * 96 4>(z ) dz = 95 (I) 

'-1 

In this relation the inequality — 1 96 < Z, or 


—I 96 < 


i ’ 


is equivalent to the inequality 


H<r + 3(196)«X + 294, 


and the inequality 


Z< 1 96 
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FIGURE I 



95 percent of the area under <p(z) lies between a and b will determine a 95 percent 
confidence interval Ordinarily one would want the confidence interval to be 
as short as possible, and it is made so by making a and b as close together as 
possible because the relation P[a <Z <b]= 95 gives rise to a confidence in- 
terval of length (of^/n)(b - a) The distance b - a will be minimized for a 
fixed area when tf>(a) » <p(b) as is evident on referring to Fig I If the point b 
is moved a short distance to the left, the point a will need to be moved a lesser 
distance to the left in order to keep the area the same, this operation decreases 
the length of the interval and will continue to do so as long as $(£) < <f>(a) 
Since <£(z) is symmetrical about z^Oin the present example, the minimum value 
of b - a for a fixed area occurs when b ■=■ ~a Thus for x = 2 7, {— 24, 5 64) 
gives the shortest 95 percent confidence interval, and (—1 17, 6 57) gives the 
shortest 99 percent confidence interval for n 

In most problems it is not possible to construct confidence intervals which 
are shortest for a given confidence coefficient In these cases one may wish to 
find a confidence interval which has the shortest expected length or is such that 
the probability that the confidence interval covers a value (i* is minimized, where 

The method of finding a confidence interval that has been illustrated in the 
example above w a general method The method entails finding if possible, a 
function (the quantity Z above) of the sample and the parameter to be estimated 
which has a distribution independent of the parameter and any other parameters 
Then any probability statement of the form P[a <Z<b] = y for known a and 
b where Z is the function, will give rise to a probability statement about the 
parameter that we hope can be rewritten to give a confidence interval This 
method or technique is fully described in Subsec 2 3 below This technique 
is applicable in many important problems, but in others it is not because in these 
others it is either impossible to find functions of the desired form or it is impos 
sible to rewrite the derived probability statements These latter problems can 
be dealt with by a more general technique to be described in Sec. 4 

The idea of interval estimation can be extended to include simultaneous 
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We note that one or the other, but not both, of the two statistics 
, XJ and ^ 2 {X it , X„) may be constant, that is, one of the two end 
points of the random interval (7*,, T 2 ) may be constant 


Definition 2 One-sided confidence Interval Let X lt , X K be a random 
sample from the density /( , 0} Let 7*, — , 2T,) be a statistic 

for which /* e < t(0)] s y, then T x is called a one-sided lower confidence 

limit for t(0) Similarly, let T 2 = t 2 (X „ , X„) be a statistic for which 
P t [r(Q) < TjJ sy, then T 2 is called a one-sided upper confidence limit for 
r(&) (y does not depend on 0 ) //// 


EXAMPLE 1 Let X lt , X„ be a random sample from /(*, 0) = tf> e 9 (jr) 
Set T 2 - / t (X t , , XJ - X - 6/Vj* and ?! - ti(X it , XJ = X + 
6}y/n, then (T lt T 2 ) constitutes a random interval and is a confidence 
i literal for t( 0) = dwithconfidencecoefficienty = FJY - 6/yfn < 0 < X + 
6}^]=P e [-2. <(X- 6)l(3/y/n) <2]~ <D(2) - G>(-2) = 9772 - 0223 = 
9544 Also, if a random sample of 25 observations has a sample mean 
of say, 17 5, then the interval (17 5 — f>!*j25 t 17 5 + fi/^/25) is also 
called a 95 44 percent confidence interval for 0 //// 


Remark If a confidence interval for 0 has been determined, then, in 
essence, a whole family of confidence intervals has been determined 
More specifically, for a given lOOy percent confidence interval estimator of 
0 a lOOy percent confidence interval estimator of t( 0) can be obtained, 
where t( ) is any strictly monotone function For example if t( ) 
is a monotone, increasing function and { T t = Ji(X t , , XJ, T 2 *= 
ti(Xx, , JfJ) is a lOOy percent confidence interval for 0, then 
(r(7*,) 1 ( 7 * 2 )) is a lOOy percent confidence interval for t(0 ) since 

< T(0) < t(7’ 3 )] =p ( [r,<0< T 2 ] = y HI! 


As was the case in point estimation our problem is twofold First, we need 
methods of finding a confidence interval and, second, we need criteria for 
comparing competing confidence intervals or for assessing the goodness of a 
confidence interval In the next subsection, we will describe one method of 
finding confidence intervals and call it the pivotal quantity method 
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FIGURE 3 



2.3 Pivotal Quantity 

As before, we assume a random sample X u ... y X n from some density /(•; 0) 
parameterized by 0, Our object is to find a confidence-interval estimate of t(0), 
a real-valued function of 0. 0 itself may be vector-valued. 

Definition 3 Pivotal quantity Let X u . . . , X H be a random sample from 
the density /(* ; 0). Let Q = p(X if X n \ 0); that is, let Q be a function 
of X \ « . . . » X n and 0. If Q has a distribution that does not depend on 0 , 
then Q is defined to be a pivotal quantity. //// 


EXAMPLE 2 Let A',, X n be a random sample from f(x\ 0) - cj) 0t9 (x). 
X -0 is a pivotal quantity since X - 0 is normally distributed with mean 0 
and variance 9//;. Also (X - 0)l(3f^/n) has a standard normal distribution 
and, hence, is a pivotal quantity. On the other hand, X/0 is not a pivotal 
quantity since X/0 is normally distributed with mean unity and variance 
9/0 2 //, which depends on 0. //// 

Our hope is to utilize a pivotal quantity to obtain a confidence interval. 

Pivotal-quantity method If Q = r/(X x , ...» X n ; 0) is a pivotal quantity and 
has a probability density function, then for any fixed 0 < y < I there will exists 
and q 2 depending on y such that P[q x < Q < q 2 ] = V- Now, if for each possible 
sample value , *„), q x <r/(x u ...,x n ;0) << 7 2 ifandonlyif / x (x x ,...,x„) < 
t (£?) < / 2 (x u for functions /, and / 2 (not depending on 0 ), then (T u T z ) 

is a lOOy percent confidence interval for r(0), where T { = t x {X u ...» X „ ), / = 1, 2. 

Before illustrating the pivotal-quantity method with a simple example we 
make several comments. First, q L and # 2 are independent of 0 since the distribu- 
tion of Q is. Second, for any fixed y there are many possible pairs of numbers 
q x and q 2 that can be selected so that P[q x < Q < qz\ — y* See Fig. 3. Different 
pairs of q x and q 2 will produce different t x and / 2 . We should want to select 
that pairoff and q 2 that will make t x and / 2 close together in some sense. For 
instance, if 4 z {X x , . . . , X n ) - / x {X u ...» X a ), which is the length of the confidence 
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interval, is not random then we might select that pair of ft and ft that makes the 
length of the interval smallest, or if the length of the confidence interval is 
random, then we might select that pair of q l and q 1 that makes the average 
length of the interval smallest 

As a third and final comment note that the essential feature of the pivotal 
quantity method is that the inequality {ft < g(x lt , x„, 6 ) <q 2 } can be re- 
written, or inverted or “ pivoted ,” as , x.) < t(fl) < / 3 (x„ , x$ for 

any possible sample value aq, , x„ [This last comment indicates that 
“pivotal quantity" may be a misnomer since according to our definition Q = 
may be a pivotal quantity, yet it may be impossible to “pivot” 
it] 


EXAMPLE 3 Let Xi, , X„ be a random sample from <f> t jfx) Consider 
estimating « (6) = 0 Q = g(X u , 0) = {X - 0)l(y/ifa) has a 

standard normal distribution and, hence, is a pivotal quantity f^q)- 
tj>(q) For given y there exist ft and ft such that P[q 2 < Q < ft] = y (in 
fact, there exist many such ft and ft) See Fig 4 

Now {ft < (x - 0)jy/\Jn < ft} if and only if {x - ftr/l/n <6 < 
x ~?i>/l/«}> so (X “ ft\/i7« X - qiyJTJn) is a lOOy percent confidence 
interval for 9 The length of the confidence interval is given by 
(^-W r l/n)*-(^-?r/I/>i) =*(ft -ftX/I/ii, so the length will be 
made smallest by selecting ft and ft so that q 2 — ft is a minimum under the 
restriction that y — F[ft < Q < ft] = <D(ft) — <D{ft) and ft — ft will be a 
minimum if ft =* —ft , as can be seen from Fig 4 //// 


The steps in the pivotal quantity method of finding a confidence interval 
are two First find a pivotal quantity, and, second invert it We will comment 
further on techniques for finding pivotal quantities in Sec 4 The method is 
thoroughly exploited in the next section 
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3 SAMPLING FROM THE NORMAL DISTRIBUTION 

Let X 2 , »•*> X„ be a random sample from the normal distribution with mean ft 
and variance a 2 . The first three subsections of this section are generated by. 
the cases (i) confidence interval for ft only, (ii) confidence interval for o 2 only, 
and (iii) simultaneous confidence interval for ft and c. The fourth subsection 
considers a confidence interval for the difference between two means. 


3.1 Confidence Interval for the Mean 


There are really two cases to consider depending on whether or not c 2 is known. 
We leave the case o 2 known as an exercise. (The technique is given in Example 3.) 
We want a confidence interval for ft when o 2 is unknown. In our general dis- 
cussion in Sec. 2 above our parameter was denoted by 6. Here 0 = (ft, cr), and 
t (0) = ft. We need a pivotal quantity. (X — fi)l(ojyfn) has a standard normal 
distribution; so it is a pivotal quantity, but {?, < (x - p)/(c/v/n) < q 2 ) cannot be 
inverted to give {Ijfa, . . . , x„) < ft < t 2 (x u . . . , x„)} for any statistics t x and i 2 . 
The problem with (Y — fi)l(c/\/n) seems to be the presence of a. We look for a 
pivotal quantity involving only ft. We know that 


(X - n)l(o!jn) X-ft 

Vi Wi - X) 2 l(n - 1)<7 2 S jjn 

has a t distribution with n — 1 degrees of freedom. [Recall that S 2 = 
£ (X, - Xfl(n - 1).] So (X - ft)l(bjjn) has a density that is independent of 
ft and <j 2 ; hence it is a pivotal quantity. Now one has <(x — fi)l(ol*Jri) < c li\ 
if and only if {x - q 2 UI-/n) < ft <x- qfyly/n)), where q 2 an dq 2 are such that 
P[q j < (Y - pV(S/Vu) < q 2 ] - Vi therefore, (X - q 2 (S}-Jn), X - q^S/y/n)) is 
a 100y percent confidence interval for ft. The length of this confidence interval 
is (q 2 — qi)(S/y/n), which is random. For any given sample the length will be 
minimized if q v and q 2 are selected so that q 2 — g 2 is a minimum. A little 
reflection will convince one that q x and q 2 should be symmetrically selected about 
0, or the following argument can be advanced. We seek to minimize 



subject to 


( 3 ) 
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where / r (0 is the density of the t distribution with n — 1 degrees of freedom 
Equation (3) gives ft as a function of ft, and differentiating Eq (3) with respect 
to ft yields 

/t&sJtt— / r(?i)«0 


To minimize L, we set dLjdqi =0, that is, 


— 


iflH- 


■Jn \dq i / ^Jn \/r(ft) / 


if and only if / T (ft) = / r (ft). which implies that ft = ft [in which case 
Jg/fO)**?] or ft ~ -ft ft = -ft is the desired solution, and such ft 
and ft can be readily obtained from a table of the / distribution 

3 2 Confidence Interval for the Variance 

Again there are two cases depending on whether or not /i is assumed known, and 
again we leave the case /i known as an exercise We want a confidence interval 
for a 2 when /r is unknown We need a pivotal quantity that can be inverted 
We know that 


(n — 1)S 2 

Vi — ~2 = ~2 


has a chi square distribution with n — 1 degrees of freedom, hence Q is a pivotal 
quantity Also, one has 

L 


if and only if 


t(n- 


ft f 


/ (n-m 1 (n - 1)S 2 \ 

l ft 
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FIGURE 5 


Chi-square density with 
1 degrees of freedom 



is a 100y percent confidence interval for a 2 , where and q 2 are given by 

P\<h < Q < fol = V- Sce Fi g- 5 - 

9 , and q 2 are often selected so that P[Q c^] =P[Q >q 2 ] = (1 ~y)j 2. 
Such a confidence interval is sometimes referred to as the equal-tails confidence 
interval for a 2 . q t and q 2 can be obtained from a table of the chi-square 
distribution. Again, we might be interested in selecting q t and q 2 so as to 
minimize the length, say L, of the confidence interval. 


L = (n 


i)S 2 (---)• 

Wi qrf 


Let f Q (q) be a chi-square density with n - 1 degrees of freedom; then differen- 
tiating 

f Q (q) dq = y 

J/Ii 

with respect to q, yields 

dq 2 


“«i 


dq 




and so 

dL 

dq 


: . 0, - ^ - D»*(- j| + “ »• 

which implies that q?/ c (q t ) = q|/ e (q 2 ). The length of the confidence interval 
will be minimized if q 2 and q 2 are selected so that 


subject to 




J’Yefo) = y- 




A solution forq, and q 2 can be obtained by trial and error or numerical integra- 
tion. 
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X fr-v 

x\„ ! 




* + 


We might note that for any q l and q 2 satisfying 


J 4 Vc(?) dq = 




is a lOQy percent confidence interval for <r 


3 3 Simultaneous Confidence Region for the Mean and Variance 

In constructing a region for the joint estimation of the mean p and variance a 1 
of a normal distribution, one might at first be inclined to use Subsec 3 1 and 3 2 
above That is, for example, one might construct a confidence region as m 
Fig 6 by using the two relations 


P[X - 1 , 7 ,v/s>“< p < X + 1 97JV /S>5 =* 95 (4) 


L X 975 X 02 5 J 


■ 95, (5) 


where 1 91i is the 975th quantile point of the t distribution with n — 1 degrees 
of freedom and x\zj and % \ TS art the Q25lb quantile point and 975th quantile 
point, respectively, of the chi square distribution with n — 1 degrees of freedom 
The region displayed m Fig 6 does indeed give a confidence region for (p, o 2 ), 
but we do not know what its corresponding confidence coefficient is [It is not 
95 1 since the two events given in Eqs (4) and (5) arc not independent ] 

A confidence region, whose corresponding confidence coefficient can be 
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readily evaluated, may be set up, however, using the independence of X and S 2 . 
Since 

„ X-n . n ("-DS 2 

2i = and Q 2 = - 


?l-Jn 


are 


each pivotal quantities, we may find numbers q u q ' 2 , and q\ such that 

r [- q ' < Wn < ^"'' <6> 


and 


Also, since S, and 2, are independent, ™e have the joint probability 


--nil- 


( 7 ) 


( 8 ) 


The four inequalities in Eq.(8)determmea region Equality 

is easily found by plotting its boun ? a . lting equations as functions 

ofp and A A region snob as the shaded area n. F,g. "• h 

We might note that a confidence of <r instead of 

"woTbecome a pair of straight fines given by 

fixed y, and y 2 . Its advantage is . n of minimum area unless the 

“jlsTirmmgio'n of minimum area is roughly elliptic in shape 
and difficult to construct. 
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3 4 Confidence Interval for Difference in Means 

Let A'j, , X m be a random sample of size m from a normal distribution with 
mean and variance a 2 , and let Y lt , Y 9 be a random sample of size n from 
a normal distribution with mean p 2 and variance a 2 Assume that the two 
samples are independent of each other We want a confidence interval for 
fi z — Pi Y—X is normally distributed with mean ft 2 — pi and variance 
a 2 /n + a 2 /m £ (X t — X)/tr 2 is chi square distributed with m— 1 degrees of 
freedom, and Y) 2 /c 2 is chi square distributed with »— I degrees of 

freedom, hence 

1(X, -X) 2 .Z(Y t -r) 2 
P + p — 


is chi square-distributed with m + n — 2 degrees of freedom Finally, 

n [( Y- X) - (n 2 - ^i)]/n/P/”* ± o i h 
y/\£(X t -X? + £ (7, - Y) 2 ]la 2 (m + n -2) 

(Y-X)-(ftt-Pi) 

y/(Um+l/n)[£ (. X \ -X) 2 +Z(Y,- ?) 2 ]l(m + n~ 2) 

= (Y- X)- fa — /i t ) 

" y/Qlm+miSD 

has a / distribution with m+ n — 2 degrees of freedom Thus it follows that 
y “ P[~ ^(i+tV 2 < Q < t<i*,v 2 l where r u+TV 2 >s the [(I + y)/2Jth quantile point 
of the t distribution with m+n — 2 degrees of freedom S 2 is an unbiased 
estimator of the common variance a 2 (The subscript p can be thought of as an 
abbreviation for ‘pooled’ , Sj is a pooled estimator of p, the two samples 
being pooled together ) Now 


if and only if 


fa -f« i) .. 

,1+T) ' 2 + <Wl 


(y “ X) - * + <i*»- 1*!** ® + 'wizj {- + 


hence 

( ( F - pi - >„.„,ygTp. (P- d+ 

is a lOOy percent confidence interval for p 2 ~ Pi 


( 9 ) 
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We assumed above that we had two independent random samples. Now 
assume that (Afj, Pi). • • • . (Aj, , P n ) is a random sample from the bivariate normal 
distribution with parameters given by //,=<?[ AT], n 2 = #[Y], = varjA'J, 

= var[P], and p = cov [A', Y]/<7 1 ii 2 . The object is to find a confidence- 
interval estimate of p 2 - ft. Let D t = K, — ATj, i = I, n; then D U ...,D„ 
are independent and identically distributed random variables with common 
normal distribution having mean p D = p 2 - p, and variance c£ = + cr| - 

IpWi • Pretending that Di, D n is our random sample and proceeding as 
in Subsec. 3.1, we obtain the following lOOy percent confidence interval for 
Pi- Pi- 

<io) 

where / (1+7)/2 is the [(1 + y)/2]th quantile point of the / distribution with n - 1 
degrees of freedom. The above obtained confidence interval for p 2 — fa is 
often referred to as the confidence interval for the difference in means for paired 
observations. The /th X observation is paired with the ith Y observation. 


4 METHODS OF FINDING CONFIDENCE INTERVALS 

In this section we will discuss two methods of obtaining a confidence interval. 
A third method will be presented in Chap. IX. 


4.1 Pivotal-quantity Method 

We described the pivotal-quantity method of finding confidence intervals in 
Subsec. 2.3, but we left unanswered the question of whether or not a pivotal 
quantity would actually exist for a given problem. The following remark gives 
a partial answer to this question. 


Remark If X l9 . . . , X n is a random sample from /(• ; 0), for which the 
corresponding cumulative distribution function F{x\ 6) is continuous 
in x , then, by the probability integral transform, F( Xp, 0) has a uniform 
distribution over the interval (0, 1). Hence — log F(X g ; S) has the den- 
sity e~ u I (0t a3) (; u ) since P[— log FCXp, ff)>u] = P[log F(Xi\ 6) £ —u] = 
P[F{Xi\ 0) = e" u for u > 0. Finally -£ log F(X t ; 6) has a gamma 

distribution with parameters n and 1; that is, 
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F^-log q 2 < - £ log F(X t , 0) < - log 

- f ~l— z n '’ i e~ t dz 

= p [ ? 1 < fl F(X,,0)<92 ] for 0 < * < ff * <1 < 11 ) 

So 

pF(X t , 6), or - £ log F(X t , 0), 

is a pivotal quantity //// 

The remark shows that any time that we sample from a population having 
a continuous cumulative distribution function, a pivotal quantity exists Note 
that as of yet we have no assurance that the pivotal quantity exhibited by the 
remark can be utilized to find a confidence intervat If, however, F(x, 0) is 

monotone in 0 for each x, then T] F(x lt 0) is also monotone in 9 for each 

i»i 

x lf , x „ , and such monotomcity allows one to find a confidence interval for 6 
We see from Fig 8 that g, <f] F(x,, 0) < q 2 if and only if ^(x,, . , xj < 0 < 
. ■*»)» where ^ and i 2 are defined as indicated 


EXAMPLE 4 Let X t , , X„ be a random sample from the density/(x, 0) - 
iy(x), then F{x, 6)x > = I (0 n (x) + / £1 ^(x) If and q 2 are 
selected [see Eq (II)] so that 

7 * P [ 9,< fl F(X,,C) < 9z ] 

= p [^i < M 

= pjtog ?, < e log jjjar. < log q 2 j 
= log q 2 < -0 log n*,< -log<3ij 
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then 


log <J 2 l°g3i 


[\ogpx, i°gtoj 


is a lOOy percent confidence interval for 0. 


mi 


We conclude this subsection with two further comments regarding the 
existence of pivotal quantities. First, if 6 is a location parameter, then X, - 0 
has a distribution independent of 0 by definition and, hence, is a pivotal quantity 
as are a variety of other random quantities, including £ %i ~ 1 

Y. — 20, etc. Second, if 0 is a scale parameter, then by definition XJO is 
distributed independently of 0 and, hence, is a pivotal quantity as are £ XJO, 
Yjjd, etc. 


4.2 Statistical Method 

As usual, we assume that we have a random sample ,X, the density 
f ( . . q \ We further assume that the parameter 9 0 is real ^ , 

interval. (In this subsection " 
true oarameter value ) We seek an interval estimate of 0 O itse . 

For instance, if a sufficient statistic (umdimensional) exist s, ™ . 

taken to be a sufficient statistic; or ifno estimator, of e 0 . 

need to be performed to obtain the conSdence .ntembam be performed. O 
of those operations will be the determinatton of the density of T. 




Let f r (t, ff) denote the density of T We will proceed as though T is a 
continuous random variable, although the technique will also work for Jas a 
discrete random variable We can define two functions, say k^O) and h 1 {S), as 
follows 

r iW ,<*> 

Mt,e)dt = Pi and J (12) 

where />j andp 2 are two fixed numbers satisfying 0 < Pi , 0 < p 2 ,and Pi + p 2 < 1 
See Fig 9 

h 2 (8) and h 2 (6) can be plotted as functions of 8 Wc will assume that both 
) and h 2 ( ) are strictly monotone, and for our sketch we will assume that 
they are monotone, increasing functions We know that h I (5) < h 2 (8) See 
Fig 10 

Let t Q denote an observed value of T, that is, t 0 - /fo, . *,) for an 
observed random sample ac lf ,x n Plot the value of r 0 on the vertical axis in 
Fig 10, and then find v t and p 2 as indicated For any possible value of t 0 , a 
corresponding r, and v 2 can be obtained, so r, and p 2 are functions of / 0 , denote 
these by t> 2 = t >j(f 0 ) and v 2 = v 2 (r 0 ) The interval (Fj, kj) will turn out to be a 
100(1 — Pl — Pz ) percent confidence interval for 0 Q To argue that this is so, 
let us repeat Fig 10 as Fig 11 and add to it (Figure 10 indicates the method of 
finding the confidence interval ) 



FIGURE 10 
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We see from Fig. 1 1 that h t (0 0 ) </ 0 = .'(at,, ...,*„)< h 2 (0 o ) if and only if 
Fj = , . . . , x„) <0 o <v 2 — i‘ 2 (, x i i • • • » x„) for any possible observed sample 

{at, x„). But by definition of and /i 2 (-), 

<4X t ,...,XJ< h 2 (0 o ) ) = 1 - ft - 
so 

•••* ^«) ^ ^0 • • • > ^n)] = 1 — Pi — P 2 1 

that is, as stated, (V t , V 2 ) is a 100(1 — Pi — p 2 ) percent confidence interval for 
0 O . where V, = *,(*„ . . . , X„) for i = 1 , 2 . 

We might note that the above procedure would work even if h t (-) and 
/i 2 (') were not monotone functions, only then we would obtain a confidence 
region (often in the form of a set of intervals) instead of a confidence interval. 


EXAMPLE 5 Let X t , . . . , X n be a random sample from the density /(x; 0 O ) = 
(l/0o)f<o. «„)(*)• We want a confidence interval for 0 O . Y n = 
max [Aj, . . . , A;] is known to be a sufficient statistic; it is also the maximum- 
likelihood estimator of do • We will use F n as our statistic T that appears 
in the above discussion ; then 

/A"" 1 1 

StO g Ao.ejW* 

For given p t and p 2 , find h^O) and /i 2 (0). Pi = jo (5) nt n 1 6 " dt implies 
that l h 0 ,(0> r" -1 dt = 0”pj/n, which in turn implies [Ai(0)]7« = or 

finally A,(0) = Similarly, p 2 =JU> im P Iies that 

6 n - [h 2 m n = or /i 2 (0) = 0(1 - P 2 ) ,/n - See Fig. 12, which is Fig. 10 

for the example at hand. 
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FIGURE » 



For observed t 0 - max [je t , , jcJ, v t is such that A 2 (i?,) = t 0 , 

that is, A 2 (pi) «» p t (l — p 2 ) ,/- = / a or v 2 = f 0 (l ~pi)~ l,n Similarly, 
v 2 - t 0 p~ t,, ‘ So a 100(1 — Pi — p 2 ) percent confidence interval for 0 O is 
given by (y„(l — Y n pl 11 ") We could worry about selecting p t 

and p 2 so that the confidence interval is shortest subject to the restriction 
that l — Pi — Pi - y The length of the confidence interval is 

r.tpr^-a-p,)- 1 '*] 

and so the length will be shortest if p 2 and p 2 are picked so as to minimize 
Pi 1/ " — (1 — p 2 ) -1/ " subject to 1 -pi — Pi~y and 0<pi+p 2 <l, 
which is accomplished by picking p 2 - 0 and p t = 1 — y 

We might note that YJ$ is a pivotal quantity and a confidence 
interval for 0 can be obtained more easily using the pivotal quantity 
method //// 


We observe in the example above, and m general for that matter, that 
hy(&) and h 2 {0) are really not needed For a given observed value t 0 - 
/(x 2t • x i>) °f the statistic T, we need to find v 2 — , x H ) and p 2 = 

v 2 (jfi, , x„) v 2 can be found by solving for $ in the equation 

»*!(*) «l« 

Pi-J_ (13) 

v 2 is the solution, i>, can be found by solving for 0 in the equation 
P 2=C /*,«)*, 

v 2 is the solution 

We mentioned at the outset that the method would work for discrete 
random variables as well as for continuous random variables Then the inte- 
grals in Eqs (12) to (14) would need to be replaced by summations Two 
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popular discrete density functions are the Bernoulli and Poisson. One could be 
interested in confidence-interval estimates of the parameters in each In 
Example 6 to follow we will consider the Bernoulli density function; the Poisson 
case is left as an exercise. 


EXAMPLE 6 Let . • . , X n be a random sample from the Bernoulli density; 
that is, P[X = 1] = 0 O = 1 -P[Z = 0]. We know that T = £ X, is a 
sufficient statistic; furthermore, T has a binomial distribution; that is, 

P[T — f] = ~ $o) n * for f = 0, 1 We want a confidence- 

interval estimate for 0 O • Suppose we observe T = t 0 (necessarily an inte- 
ger). According to Eqs. (13) and (14) we need to solve for 0 in each of 
the equations 

and 

To actually solve these equations, a table of the binomial distribution is 
useful. If p t = .0509, p 2 = .0159, n = 20, and if T- 4 is observed then 
the 93.33 percent confidence interval (.05, .40) for 0 O is obtained. //// 


5 LARGE-SAMPLE CONFIDENCE INTERVALS 

We have seen in our studies of point estimation that it is sometimes possible to 
find a sequence of estimators, say T n = / fl (Jfi, . . • , X n )> of 0 in a density/(- ; 0) 
that are asymptotically normally distributed about 0; that is, T n is approximately 
normally distributed with mean 0 and variance, say, t?„(0), where a 2 n (9) indicates 
that the variance is a function of 0 (since it will ordinarily depend on 0) and the 
sample size n. In particular, we have seen in Sec. 9 of Chap. VII that for large 
samples the maximum-likelihood estimator, say €>„ = 5 n (Ai» • X„), for a 

parameter 0 in a density /(• ; 0) is approximately normally distributed about 0 
under rather general conditions. The large-sample variance of the maximum- 
likelihood estimator was seen to be, say, 

1 -1 

= nSMdldO) logf{X; 0)} 2 ] ~ nS^[(8 2 l30 2 ) \ogf(X\ 0) j 


( 15 ) 
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When such a sequence of asymptotically normally distributed estimators { T 9 } of 
0 exists, tt is sometimes possible to obtain approximate confidence intervals 
quite easily (7; — 0)!o m (B) can be treated as an approximate pivotal quantity, 
and, therefore, for large sample size a confidence interval with an approximate 
confidence coefficient y may be determined by converting the inequalities 


T m -0 

Z< a M <Z ' 


06) 


where z = zji +,,/2 IS defined by <X>(/ (1 +,„*) = (I + y)/2 or 4>fc) - 4>(-z) » y 
The above described method will always work to find a large sample confidence 
interval provided the inequality —z < (T„ - &)/a a (0) < z can be inverted 


EXAMPLE 7 Let X t , X„ be a random sample from the density f{x, 0) = 
Oe *V (0 „,,(*) We know (see Example 51 in Chap VII) that the maxi- 
mum likelihood estimator of 0 which is l/X m , has an asymptotic normal 
distribution with mean 0 and variance equal to 


Therefore, 


and hence 


ol(0) = 


t e* 

log nx, flFi “ /i 


r I IX.-0 1 

7 *T z< -T^ < z \ 


JiA 


iff. 


iff. 


il+zjy/n l — z/y/tti 


/ 1 ff, iff. \ 

\I + Z/y/ n * 1 — zjy/n! 

u. a. large sanujte. cooS/fence. wAccvaJ. Oar d wAb. V). vpint.t, maSe. ccwMawat 
coefficient y where z is given by fl>(z) — O(-z) = y }))} 


EXAMPLE 8 Consider sampling from the Bernoulli distribution with param 
eter 0 -P[X = 1] = 1 — P[X = 0] The maximum likelihood estimator 
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of 0 is 6 = X n , and it has variance <x 2 (0) = 0(1 - 0)/„. A n approximate 
lOOy percent confidence interval for 0 is obtained by converting the in- 
equalities in 


> 


z < 


0-0 


70 ( 1 - 0 )/* 


<2 


to get 


'2nQ> + r 2 — z74*0 + z 2 — 4«© 2 
2(« + z 1 ) 


^ a ^ 2nO + z 2 - f zJ AnQ + z 2 - 4nQ 2 ' 

< U < — — ~ v . 

2(/i 4- z 2 ) J 7 


(17) 


These expressions for the limits may be simplified since in deriving the 
large-sample distribution certain terms containing the factor 1 [^fn are 
neglected; that is, the asymptotic normal distribution is correct only to 
within error terms of size a constant times 1 j*Jn. We may therefore 
neglect terms of this order in the limits in Eq. (17) without appreciably 
affecting the accuracy of the approximation. This means simply that we 
may omit all the z 2 terms in Eq. (17) because they always occur added to a 
term with factor n and will be negligible relative to n when n is large to 
within the degree of approximation that we are assuming. Thus Eq. (17) 
may be rewritten as 


-[s-x^SL =£><,<6+7 


©(!-©)' 


(18) 


In particular, 




-©) 


<0 <0 + 1.96 


10(1 - 0 } 


: .95 


gives an approximate 95 percent confidence interval for 0 for large 
samples. ^ 


We may observe that Eq. (18) is just the expression that would have been 
obtained had © been substituted for 0 in c 2 (0). The substitution would imply 
that 

©-0 


7©(1 - ©)/« 
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is approximately normally distributed with mean 0 and unit variance It is, 
in fact, true in general that in the asymptotic normal distribution of a maximum 
likelihood estimator 0 the variance tr *(0) may be replaced by its estimator o’(0) 
without appreciably affecting the accuracy of the approximation Wc shall not 
prove this fact but shall use it because it greatly simplifies the conversion of the 
inequalities that is required to get the large sample confidence intervals For 
instance 


0-0 
< <r.( 0) ' 


readily converts to give 

P[0 - ze„(0) < 0 < 0 + ztr„(0)J w y, (19) 

where 0 is the asymptotically normally distributed maximum likelihood esti 
mator of 0 c„(C>) in this expression is the maximum likelihood estimator of 
1 7.(0) (which is the large sample standard deviation of 0) and z is given by 
4>(z) - <D(-z) = y 

We noted in Sec 9 of Chap VII that under regularity conditions the joint 
distribution of the maximum likelihood estimators of the components of a 
k dimensional parameter is asymptotically normally distributed Although we 
will not so argue, such a result could be used to obtain a large sample confidence 
region 

The large sample confidence intervals presented in this section have an 
optimum property which we shall point out but not prove Recall that in the 
earlier sections particularly Sec 3, of this chapter we were concerned with finding 
the shortest interval for a given probability Loosely speaking an analogous 
optimum property of large sample confidence intervals based on maximum- 
likelihood estimators is this Large sample confidence intervals based on the 
maximum likelihood estimator will be shorter, on the average than intervals deter - 
mined by any other estimator 


6 BAYESIAN INTERVAL ESTIMATES 

In Sec 7 of Chap VII we examined what is called Bayes estimation There 
we assumed that a random sample say X v , X K , from some density /( , 0) = 
/( 1 0) was available where the form of the function /( | ) was known and the 
fixed value of 0 was unknown We further assumed that the unknown fixed 
value of 0 was the value of a random variable O with known density, denoted by 
ff©( ) and called the prior density of O We then used this additional knowledge 
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of a known prior density to define the posterior distribution of 0, and from this 
posterior distribution we defined the posterior Bayes point estimator of 0. In 
this section we use this same posterior distribution of 0 to arrive at an interval 
estimator of 9. 

If/(- 10) is the density sampled from and g e (-) is the prior density of O, 
then the posterior density of © given . . . , X n ) = (*„ . . . , xj is [recall Eq; (19) 

of Chap. VII] 


feiXi=x, X„*xSQ\ X l’ "*•» x n) = r „ T • (20) 

j[Uf( x l W\9e(0)d0 

For fixed y, any interval, say (t u t 2 ), satisfying 

J /epr,=*i x»-xSP\ x » ••••*«) dd ■ V ( 21 ) 


is defined to be a lOOy percent Bayesian interval estimate of 6. In practice, one 
would naturally pick those t x and t 2 satisfyingEq. (21) for which t 2 — t 1 is smallest. 
Note that = J ( (x u x„ ); that is, t ( is some function of the observations 

X lf • • • 9 x„ . 


EXAMPLE 9 Let X u . . . , X n be a random sample from the normal density 
with mean 0 and variance 1. Assume that 0 has a normal density with 
mean x 0 and variance 1. Consider estimating 9. We saw in Example 44 
of Chap* VII that the posterior distribution of 0 is normal with mean 

Z xJQ* + 1) and variance 1 j{n 4- 1). We seek t Y and t 2 satisfying 
o 


y = f%x, X n **x„(@\ x ly • * X n) 

Jkd. 

\ 7l/(n + 1) 1 \ >/l/(n + 1) • 

If z is such that <D(z) — G>(— z) = y, then 


n 



n 




( 22 ) 
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gives the shortest lOOy percent Bayesian interval estimate of 0 Note that 
the corresponding lOOy percent confidence interval estimate of 0 is given 

by xjn — z^/I/n £ xjn + Zy/ljnj The only difference in the results 

of the two methods for this example is that the sample size seems to 
increase by 1 and the apparent “additional observation” is the mean of 
the assumed prior normal distribution //// 


PROBLEMS 

1 Let A” be a single observation from the density 

/(*,*)“ ft t* 'W*>. 

where 6 > 0 

(o) Find a pivotal quantity, and use it to find a confidence-interval estimator of 9 
(4) Show that (F/2, Y) is a confidence interval for 9 Find its confidence coef 
ficient Also, find a better confidence interval for 9 Define Y *= — 1 /log X 

2 Let .Yu , X, be a random sample from N{9, 9), 0>O Give an example of a 
pivotal quantity, and use it to obtain a confidence-interval estimator of 6 

3 Suppose that T t is a lOOy percent lower confidence limit for rf?) and T t is a lOOy 
percent upper confidence limit for r{6) Further assume that P,[T t <Ti]*= 1 
Find a 1 00(2 y — 1) percent confidence interval for r(9) (Assume y > 1 ) 

4 Let X,, , X, denote a random sample from fix, 9) = I ( ,. t «♦*>(*) Let 

Yi< < Y. be the corresponding ordered sample Show that (F), Y.) is a 
confidence interval for 6 Find its confidence coefficient 

5 Let X„ , X, be a random sample from f{x, 9) — 6e~ u I { B ./x) 

(а) Find a lOOy percent confidence interval for the mean of the population 

(б) Do the same for the variance of the population 

(c) What is the probability that these intervals cover the true mean and true 
variance, simultaneously? 

(d) Find a confidence-interval estimator of e~*-= /’[-*'> 1] 

(e) End a pivotal quantity based only on Y t , and use it to find a confidence- 
interval estimator of 9 (Fi*»mm [X t , , XI) 

6 AT is a single observation from 6e~ , ‘I i0 «,,(*), where 6 >0 

(a) (X, 2X) is a confidence Interval for 1/9 What is its confidence coefficient? 
(4) Find another confidence interval for 1/9 that has the same coefficient but 
smaller expected k.ngth 

7 Let Xi, Xi denote a random sample of size 2 from N(9, 1) Let F, < Y 2 be the 
corresponding ordered sample 

(a) Determine y in F[ Fi < 0 < Fj] -= y Find the expected length of the interval 
(Ei.Fi) 

(4) Find that confidence interval estimator for 9 using X— 9 as a pivotal quantity 
that has a confidence coefficient y, and compare the length with the expected 
length in part (a) 
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8 Consider random sampling from a normal distribution with mean p and variance 

a 2 . 

(a) Derive a confidence interval estimator of p when a 2 is known. 

(b) Derive a confidence interval estimator of a 2 when p is known. 

9 Find a 90 percent confidence interval for the mean of a normal distribution with 
tr = 3 given the sample (3.3, -.3, -.6, -.9). What would be the confidence 
interval if a were unknown? 

JO The breaking strengths in pounds of five specimens of manila rope of diameter 
^ inch were found to be 660, 460, 540, 580, and 550. 

(e) Estimate the mean breaking strength by a 95 percent confidence interval 
assuming normality. 

(b) Estimate the point at which only 5 percent of such specimens would be 
expected to break. 

(c) Estimate o 2 by a 90 percent confidence interval; also a. 

(d) Plot an 81 percent confidence region for the joint estimation of ft and a 2 ; for 
ft and or. 

]J A sample was drawn from each of five populations assumed to be normal with the 
same variance. The values of (n - 1)S 2 = 2 (X t - X) 2 and n, the sample size, 
were 

S 2 : 40 30 20 42 50 

n: 6 4 3 7 8 

Find 98 percent confidence limits for the common variance. 

12 Develop a method for estimating the ratio of variances of two normal populations 
by a confidence interval. 

13 What is the probability that the length of a / confidence interval for ft when 
sampling from a normal distribution will be less than a for samples of size 20? 

14 In sampling from a normal population compare the average length of the two 
confidence intervals for the mean ft when (a) a is known and ( b ) a is unknown. 

15 Show that the length and the variance of the length of the t confidence interval 
for ft when sampling from a normal population approach 0 with increasing sample 
size. 

16 In sampling from a normal population with both ft and a unknown, how large a 
sample must be drawn to make the probability .95 that a 90 percent confidence 
interval for ft will have length less than a/5? 

17 Show that the length of the confidence interval for a of a normal population 
approaches 0 with increasing sample size. 

18 To test two promising new lines of hybrid corn under normal farming conditions, 
a seed company selected eight farms at random in Iowa and planted both lines in 
experimental plots on each farm. The yields (converted to bushels per acre) for 
the eight locations were 

Line A: 86 87 56 93 84 93 75 79 

Line B: 80 79 58 91 77 82 74 66 

Assuming that the two yields are jointly normally distributed, estimate the difference 
between the mean yields by a 95 percent confidence interval. 



JLA. 

TESTS OF HYPOTHESES 


1 INTRO DUCTION AND SUMMARY 

There are two major areas of statistical inference: the estimation of parameters 
"dfc tedng of hypotheses. We shall study the second of these two areas m 
this chapter. Our aim tvill be to develop general methods for testing hypotheses 

and to apply those methods to some common problems. The methods developed 

will be of further use in later chapters. .• t p naram- 

In experimental research, the object is sontebn.es merely to 
eters. Thus one may wish to estimate the yield o a new est i mate 

Bu, more often the ultimate 

One may wish, for example, to compare ? ’L Iine replace the standard 
standard line and perhaps recommend research. One may 

line if it appears superior. This is a common “ Hght bulbs w jn increase 
wish to determine whether a new roeth _ effective in treating a 

the life of the bulbs, whether a one method of preserving 

certain infection than a standard germ > vitamins conC e r ned, 

foods is better than another insofar a 
and so on. 


402 TESTS OF HYPOTHESES 


Using the light bulb example as an illustration, let us suppose that the 
average life of bulbs made under a standard manufacturing procedure is 1400 
hours It is desired to test a new procedure for manufacturing the bulbs The 
statistical model here is this We are dealing with two populations of light bulbs 
those made by the standard process and those made by the proposed process 
We know (from numerous past investigations) that the mean of the first popula- 
tion is about 1400 The question is whether the mean of the second population 
is greater than or less than 1400 TradiUonally, to answer this type of question, 
we set up the hypothesis that one mean is greater than the other mean Then, 
on the basis of a sample from the population of the proposed process we shall 
either accept or reject the hypothesis 

For our example, we formulate the hypothesis that the proposed process 
is no better than the standard process Generally we hope that the hypothesis 
will be rejected To test the hypothesis, a number of bulbs are made by the new 
process and their lives measured Suppose that the mean of this sample of 
observations is 1550 hours The indication is that the new piocess is better, but 
suppose that the estimate of the standard deviation of the mean &f^/n is 125 
{« being the sample size) Then a 95 percent confidence interval for the mean of 
the second population (assuming normality) is roughly 1 300 to 1 800 hours The 
sample mean 1550 could very easily have come from a population with mean 1400 
We have no strong grounds for rejecting the hypothesis If, on the other hand, 
yfn were 25, then we could very confidently reject the hypothesis and pronounce 
the proposed manufacturing process to be superior 

The testing of hypotheses is seen to be closely related to thsTproFIem of 
estimation It will be instructive, however, to develop the theory of testing 
independently of the theory of estimation, at least in the beginning 

In order to conveniently talk about testing of hypotheses, we need to intro- 
duce some language and notation and give some definitions As was the case 
when w r e studied estimation we will assume that we can obtain a random sample 
X lt , X„ from some density/( ,0) A statistical hypothesis will be a hypoth- 



Definition I Statistical hypothesis A statistical hypothesis is an asser- 
tion or conjecture about the distribution of one or more random variables 
If the statistical hypothesis completely specifies the distribution, then it is 
called simple , otherwise, it is called composite Hit 

Notation To denote a statistical hypothesis, we will use a script capital 
J f followed by a colon that in turn is followed by the assertion that 
specifies the hypothesis //// 
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EXAMPLE 1 Let X |, . X n be a random sample from f(x ; 0 ) = <f > 0 25 (x). 
The statistical hypothesis that the mean of this normal population is less 
than or equal to 17 is denoted as follows: 3di 0 £ 17. Such a hypothesis 
is composite; it docs not completely specify the distribution. On the 
other hand, the hypothesis 3fd : 0 = 17 is simple since it completely specifies 
the distribution. jjjj 

Definition 2 Test of a statistical hypothesis A test of a statistical hy- 
pothesis 3d is a rule or procedure for deciding whether to reject 3d. //// 

Notation Let us use a capital upsilon Y to denote a test. //// 

•' It. 

EXAMPLE 2 Let X u X n be a random sample from f{x\ 0) - <f> 0 , 2 s(*)- 
Consider 3d: 0 <, 17 . One possible test T is as follows: Reject 3fd if and 
only if X > 17 4- 5/^/ri. //// 

A test can be either randomized or nonrandomized. The test Y given in 
Example 2 above is an example of a nonrandomlzed test Another possible 
test, say V , of 3d in Example 2 is the following: Toss a coin, and reject Xd if 
and only if a head appears. Such is an example of a randomized test. Although 
wc will make little use of randomized tests in this book, we do include their 
definition. Definitions of both nonrandomized and randomized tests follow. 

Notation As in previous chapters, wc let X denote the sample space of 
observations, or the potential data set; that is, X = {(xj > . . . , x „) : (*i > * • • > *«) 
is a possible value of (X u 2^)}. I II I 

Definition 3 Nonrandomized test and critical region Let a test T of a 
statistical hypothesis 3d be defined as follows: Reject Jd if and only if 
(x lt ...,* n )eC T , where C Y is a subset of X; then Y is called a nonrandom- 
ized testy and C Y is called the critical region of the test Y. III! 


EXAMPLE 3 Let X u X n be a random sample fr omf(x; 0) = fa.zst*)- 
X is euclidean n space. Consider 3d : 0 < 17 and the test Y. Reject 3d if 
and only if *>17 +5/Jn. Then Y is nonrandomized, and C r = 
17 + 5 /7«}. 1111 
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and 


c ={(*!,..., *10): I *,> 5 j. 

The critical function of test T is given by 

(\ if (xj, . . . , Xjq) s C 
• * * > Xio) - 1 1/2 if (x lf . . . , x I0 ) e B 

1 ° -.-tX^eA. HU 

Remark We saw that a nonrandomized test was specified by its critical 
region. Likewise, a randomized test is specified by its critical function. 
In fact any function t p(* y . •) with domain X and counterdomain the 
interval [0, 1] is a possible critical function and defines a randomized 
test mi 

The following remark shows that a nonrandomized test is a particular case 
of a randomized test. 


Remark If test Y has a critical function defined by 



if (x u . . . , x„), e cy 

otherwise 




then Y is a nonrandomized test with a critical region C r . 


//// 


As we mentioned earlier, we will not make extensive use of randomized 
tests. Theorem 1 below requires their use; other than that, their only use will 
be in obtaining tests of exact size (see Definition 7), and then only for sampling 
from discrete distributions. 

In many hypotheses-testing problems two hypotheses are discussed: The 
first, the hypothesis being tested, is called the null hypothesis, denoted by 0 , 
and the second is called the alternative hypothesis, denoted by The think- 
ing is that if the null hypothesis is false, then the alternative hypothesis is true, 
and vice versa. We often say that y ? 0 is tested against, or versus, If the 
null hypothesis is not rejected, we say that ftf 0 is accepted. With this kind 
of thinking, two types of errors can be made. 

Definition 5 Types of error and size of error Rejection of 0 when it is 
true is called a Type I error , and acceptance of when it is false is called 
a Type II error. The size of a Type I error is defined to be the probability 
that a Type I error is made, and similarly the size of a Type II error is the 
probability that a Type II error is made. I II I 
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If the distribution froravvhich the sample was obtained is parameterized by 
0, where 0 e §, then associated with any test is a power function, defined as in 
Definition 6 

Definitions Power function Let T be a test of the null hypothesis JP 0 . 
The po*er function of the test T, denoted by is defined to be the 
probability that Jf 0 is rejected when the distribution from which the 
sample was obtained was parameterized by 0 fjff 

The power function will play the same role in hypothesis testing that mean- 
squared error played in estimation It will usually be our standard m assessing 
the goodness of a test or m comparing two competing tests An Ideal power 
function, of course, is a function that is 0 for those 0 corresponding to the null 
hypothesis and is unity for those 8 corresponding to the alternative hypothesis. 
The idea is that you do not want to reject if is true and you do want to 
reject jf Q when is false 

Remark jr r (d) =P,[reject sf 0 ], where 9 is the true value of the param- 
eter If T is a nonrandomized test, then n^O) = Pe\{X u . . , , X t ) e C T ], 
where Ct ts the critical region associated with test T If T is a randomized 
test with critical function i// r (• , then 

* r (0) « P ( [reject 

= / J Pfreject JT 0 |*, t . . xj/’*,.. ,xS x i . 0)11^1 

“J x,;Sf)^\dx t 

The argument is similar for discrete random variables. //// 


EXAMPLE 5 Let X t , . X„ be a random sample from f(x; 6 ) = <p ti 23 (;t). 
Consider Og 17 and the test T: Reject if and only if X > 17 + 5fy/n. 

■n + SlJn-B 




1t r (0) 

For n = 25, n r (0) is sketched in Fig 1. 
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The power function is useful in telling how good a particular test is. 
In this example, if 0 is greater than about 20, the test T is almost certain 
to reject Jf 0 , as it should. And if 0 is less than about 16, the test T is 
almost certain not to reject Sf 0 , as it should. On the other hand, if 
17 < 0 < 18 (so .if Q is false), the test T has less than half a chance of 
rejecting Jf 0 . //// 

Definition 7 jSize of test Let Y be a test of the hypothesis ff 0 : 0 e S 0 » 

where S 0 c STj that is, S 0 is a subset of the parameter space (5. The size 

of the test Y of Jf Q is defined to be sup (tc t ( 0)]. The size of the test for 

6 £ 

a nonrandomized test is also referred to as the size of the critical region. 

1111 ‘ 

Remark Many writers use the terms “significance level” and “size of 
test” interchangeably. We, however, will avoid use of the term “signi- 
ficance level," intending to reserve its use for Jests of significance, a type of 
"statis tica rinference that is closely related to hypothesis testing. Tests of 
significance will not be considered in this book; the interested reader is 
referred to Ref. [37]. till 

EXAMPLE 6 Let X u X„ be a random sample from f{x\ 0) = <j> g , 2 s(.x). 
Consider the Jf 0 : 0 < 17 and the test T: Reject jf 0 if X >17 + 5 jjh. 

§o = {0: 0 <, 17} and the size of the test T is sup fa T (0)] 

0 eQq 

- , s “, p ,[' - 4> (’ 7 “ ' " m) * ' ,59 ' "" 

In our study of point estimation, we found that for certain considerations 
We could restrict attention to estimators that were functions of sufficient statistics 
only. The same is true for testing hypotheses when the power function is used 
as a basis of comparing tests, as the following theorem shows. 
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Theorem 1 IfJf„ , X H is a random sample from /(;*,$), -where Beg, 
and Si = a x {X u , JO, , S, = o,(X u , X n ) is a set of sufficient sta- 
tistics, then for any test Y with critical function \ff r , there exists a test, say 
T , and corresponding critical function, say . depending only on the set 
of sufficient statistics which satisfies « r (0) = jt x (0) for all 0 e § 


proof Define \}>r( s u » J,) = (?[^ r (A'j, , JOItfj - j,, ,, 

S, = j,] , then i/f r is a critical function Furthermore, it r (0) = 
£Mt(S u ,S r )]~S,lW r{X lt , S,)) = <W r (*i. , 

X,)] = n r (0) Jiff 

The theorem show’s that given any test, another test which depends only on 
a set of sufficient statistics can be found, and this new test has a power function 
identical to the power function of the original test So, in our search for good 
tests we need only look among tests that depend on sufficient statistics 

We have introduced some of the language of testing in the above The 
problem of testing is like estimation in the sense that it is twofold First, a 
method of finding a test is needed, and, second, some criteria for comparing 
competing tests are desirable Although we will be interested in both aspects 
of the problem, we will not discuss them in that order First we will consider, 
in Sec 2, the problem of testing a simple null hypothesis against a simple alter- 
native Two approaches will be assumed The first will use the power function 
as a basis for setting goodness criteria for tests, and the second will use a loss 
function The Neyman Pearson lemma is stated and proved It will turn out 
that all those tests which are best in some sense, will be of the form of a simple 
likelihood ratio, which is defined 

Tests of composite hypotheses will be discussed in Sec 3 The section 
will commence, in Subsec 3 1, with a discussion of the generalized likehhood- 
ratio principle and the generalized likelihood ratio test This principle plays a 
central role m testing just as maximum likelihood played a central role in es- 
timation It is a technique for arriving at a test that in general will be a good 
test just as maximum likelihood led to an estimator that m general was quite a 
good estimator For a book of the level of this book, it is probably the most 
important concept in testing The notion of uniformly most powerful tests 
will be introduced in Subsec 3 2, and several methods that are sometimes use- 
ful in finding such tests will be presented Unbiasedness and invariance m es- 
timation are two methods of restricting the class of estimators with the hope 
of finding a best estimator within the restricted class These two concepts play 
essentially the same role in testing, they are methods of restricting the totality of 
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possible tests with the hope of finding a best test within the restricted class. We 
will discuss only unbiasedness, and it only briefly in Subsec. 3.3. Subsection 
3.4 will summarize several methods of finding tests of composite hypotheses. 

Section 4 will be devoted to consideration of various hypotheses and 
tests that arise in sampling from a normal distribution. Section 5 will consider 
tests that fall within a category of tests generally labeled chi-square tests. 
Included will be the asymptotic distribution of the generalized likelihood-ratio, 
goodness-of-fit tests, tests of the equality of two or more distributions, and tests 
of independence in contingency tables. Section 6 will give the promised dis- 
cussion of the connection between tests of hypotheses and interval estimation. 
The chapter will end with an introduction to sequential tests of hypotheses in 
Sec. 7. 

The reader will note that our discussion of tests of hypotheses is not as 
thorough as that of estimation. Both testing and estimation will be used in 
later chapters, especially in Chap. X. Also, a number of the nonparametric 
techniques that will be presented in Chap. XI will be tests of hypotheses. 

We stated at the beginning of this section that testing of hypotheses is one 
major area of statistical inference. A type of statistical inference that is closely 
related (in fact so closely related that many writers do not make a distinction) to 
hypothesis testing is that of significance testing. The concept of significance 
testing has important use in applied problems; however, we will not consider it 
in this book. The interested reader is referred to Ref. [37]. 


2 SIMPLE HYPOTHESIS VERSUS 
SIMPLE ALTERNATIVE 

2.1 Introduction 

In this section we consider testing a simple null hypothesis against a simple 
alternative hypothesis. This case is actually not very useful in applied statistics, 
but it will serve the purpose of introducing us to the theory of testing hypotheses. 

We assume that we have a sample that came from one of two completely 
specified distributions. Our object is to determine which one. More precisely, 
assume that a random sample X u ..-,X n came fr° m density / ,(*) or/ ,(x) 

and we want to test Jf 0 : X- t distributed as /<,(•)> abbreviated X, ~/o( 0. versus 
If we had only one observation x t and /<,(■) and _/!(•) were 
as in Fig. 2, one might quite rationally decide that the observation came from 
/„(■) if/ofo) >f l (x 1 ) and, conversely, decide that the observation came from 
fi(') if/i(*i) >/ 0 (*i)- This simple intuitive method of obtaining a test can be 
expanded into a family of tests that, as we shall see, will contain some good tests. 
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FIGURE 2 


Definition 8 Simple likelihood ratio test Let Xu » X a be a random 
sample from either / 0 ( ) or f x { ) A test T of X t ~ / 0 ( ) versus 
jfi X t ~fi{ ) is defined to be a simple likelihood ratio test if T is defined 
by 

Reject Jf, o if X < k. 

Accept JT 0 if X > k. 

Either accept X " 0 , reject , or randomize if X = k, (I) 

where 


l ■* 2(x lP ,x,) 


R foix,) Ux t ,x,) u 

TV iW) L,(x *’ Ll 

1- 1 


and fc is a nonnegative constant [Ly t=Zy(x |t , xj is the likelihood 
function for sampling from the density ) ] //// 

For each different k we have a different test For a fixed k the test says to 
reject Jf 0 if the ratio of likelihoods is small, that is, reject JP 0 if it is more likely 
(L x is large compared to £<,) that the sample came from/j( ) than from/^ ) 
Such a test certainly has intuitive appeal In fact, one might suspect that an 
optimum test will have to be the form of a simple Iikebhood ratio test 

Optimahtv of a test of a simple hypothesis versus a simple alternative can 
be approached in two ways Oneway, using the power of the test to set goodness 
criteria, is discussed in Subsec 2 2, and the other way, using a loss function and a 
decision theoretical approach is considered in Subsec 2 3 


2 2 Most Powerful Test 

Let X It , X. be a random sample from the density / 0 { ) or the density /jC ) 
Let us wnte / 0 (x) =/(*, 0 O ) and/,(x) *» f(x, 0 t ), then X lt , X H is a random 
sample from one or the other member of the parametric family {f(x, 6) 0 = 0 0 
or 8 = 0j) 0t= {0 O , 0j) is a parameter space with only two points in it 0 O and 
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B t are known. We want to test tf 0 :0 = 0 O versus = Corresponding 
to any test T of versus j is its power function 7r r (0). A good test is a test 
for which n r (0 o ) = P[reject jT 0 \jf 0 is true] is small (ideally 0) and = 
P[reject XC o I ^ o 1S false] is large (ideally unity). One might reasonably use the 
two values n r (0 o ) and n r (9 l ) to set up criteria for defining a best test. n r (0 o ) = 
size of Type I error, and 1 - 7^(0,) = P[accept 1 Xf 0 is false] = size of Type II 
error; so our goodness criterion might concern making the two error sizes small. 
For example, one might define as best that test which has the smallest sum of the 
error sizes. Another method of defining a best test, made precise in the following 
definition, is to fix the size of the Type I error and to minimize the size of the 
Type II error. 

Definition 9 Most powerful test A test Y* of 0-0 o versus 
: 0 = 0 X is defined to be a most powerful test of size a (0 < a < 1) if and 
only if: 

0) ^r*(^o) — ft. (2) 

(ii) TTy^Oj) > TtyfOj) for any other test Y for which n r (0 o ) < a. (3) 

III I 


A test Y* is most powerful of size a if it has size a and if among all other 
tests of size a or less it has the largest power. Or a test Y* is most powerful of 
size a if it has the size of its Type I error equal to a and has smallest size Type II 
error among all other tests with size of Type I error a or less. 

The justification for fixing the size of the Type I error to be a (usually 
small and often taken as .05 or .01) seems to arise from those testing situations 
where the two hypotheses are formulated in such a way that one type of error is 
more serious than the other. The hypotheses are stated so that the Type I error 
is the more serious, and hence one wants to be certain that it is small. 

The following theorem is useful in finding a most powerful test of size a. 
The statement of the theorem as given here, as well as the proof, considers only 
nonrandomized tests. We might note that the statement and proof of the 
theorem can be altered to include all randomized tests. 

Theorem 2 Neyman-Pearson lemma Let X x , ..., X n be a random 
sample from f(x; 0), where Q is one of the two known values 0 o or 0 U 
and let 0 < a < 1 be fixed. 

Let k* be a positive constant and C* be a subset of X which satisfy, 
(i) W «6C*J = «. (4) 
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« x ~ m.Z '*■ '"• )eC ‘ ® 

an<U&fc*if(xi, , x„) e C* 


Then, the test Y * corresponding to the critical region C* is a most powerful 
test of size a of 0~G o versus fe x 0 -6 1 (Recall that Lj ~ 

L(0j, *n > *■) “ fl f( x i' ®/) f 0I J - 0 or 1 and C* is the complement of 
C*, that is, C* = -E-C*] 

proof Suppose that fc* and C* satisfying conditions (0 and (u) 
exist If there is no other test of size a or less, then T* is automatically 
most powerful Let T be another test of size a or less, and let C be its 
corresponding critical region We have * JQ eCJ^a We 

must show that rr r ,(fl,) 2: Jtr(flj) to complete the proof {For any subset 

iJ of X, let us abbreviate J J 0j) dx l j as J K L t for j = 0, I 

Our notation indicates that/ 0 ( ) and/,( ) are probability density func- 
tions The same proof holds for discrete density functions } Showing 
that n r ,(Pi) 2: 7t T (0i) is equivalent to showing that f c , Z t 2: J c L t See 
Fig 3 

Now J c . L t -j c Li = !c*C ii - Jcc L t > (1 /fc*) Jc-cio - (I/**) Jcc^o 
since Zj 2:L 0 fk* on C* (hence also on C*C) and L x £L 0 lk*, or ~L t t 
—L 0 /k*, on C* (hence also on CC*) But (1/fc*) (jc*c i-o ~Jcc* •£<>) 
= (l/k*)(j c . c L 0 + J c . c L 0 — J c , c L 0 — Jcc.Zo) = (l/A:*)(Jc* ~ fc^o) — 

(l/fc*)(« - size of test T) £ 0, so J c . Z t — J c Z* ^ 0 as was to be shown 

mi 

We comment that fc* and C* satisfying conditions (i) and (u) do not always 
exist and then the theorem, as stated, would not give a most powerful size a 
test However, whenever / 0 ( ) and f x ( ) are probability density functions, a k* 
and C* wtll exist. Atthough the theorem does not explicitly say how to find fc* 
and €*-, imjfticifly it does since fne form ot the test, fnat is, the critical region, is 
given by Eq (5) In practice, even though fc* and C* do exist, often it is not 
necessary to find them Instead the inequality k <, fc* for (x t , , x„) e C* is 
manipulated into an equivalent inequality that is easier to work with, and the 
actual test is then expressed in terms of the new inequality The following 
examples should help clarify the above 
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FIGURE 3 


EXAMPLE7 Let X lt .. A'„ be a random sample from f(x;0) == Oe~ 0x I {Q: m y(x), 
where 6 = 6 0 or 0 = 0 1 . 0 o and 9 t are known fixed numbers, and for 
concreteness we assume that 6 t > 0 o . We want to test 0 = 0 o versus 
‘.O — Oi. NowL 0 = OS exp (-0 O T, x t)< A = 0? exp (— £ * ( ), and 
according to the Neyman-Pearson lemma the most powerful test will have 
the form: Reject Jf 0 if ^ k* or if (0 o /0i) n exp [-(0 O - 0 t ) £ x ,] < k*, 
which is equivalent to 

where k' is just a constant. The inequality ). k* has been simplified and 
expressed as the equivalent inequality x t <, k'. Condition (i) is a = 
/^[reject Jf 0 ] = ?<,„[ £ X, £ k’]. We know that £ X t has a gamma 
distribution with parameters n and 0; hence 

M£*«**'] «/‘ Y^OSS-'e-^dx- a, 

an equation in k\ from which k' can be determined; and the most powerful 
test of size a of yf 0 : 0 = 0 o versus 3P j : 0 = 0j) 0 1 > 6 0 is this: Reject fc’ 0 
if L x i ^ *'* where k' is the ath quantile point of the gamma distribution 
with parameters n and 0 O . HH 

EXAMPLE 8 Let X u .... X n be a random sample from f(x; 0) = 
0*(1 - 6y ~ x I l0i d(x), where 0 = 0 O or 0 = 0^ We want to test Jf 0 :0 = d o 
versus 0 = 0 t , where, say, 9 o <0 J . L 0 = Ol x '{l — 0 o ) n Zxt , A = 
0**(1 ~ 0i)" -1 ". and so X £ k* if and only if 

0o x '(l - 0o)"" I 70p'O - o L y~ Zx ‘ < k*, 




if and only if 
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IX 


or if and only if £ x t £ k\ where k is a constant {Note that 
log, Po(l - - Oo)^i ] < 0 } So a most powerful testwould be of the 

form Reject 3tf 0 if £ x, is large For definiteness, let us take 0 O « 

0j = and n — 10 We must find k ' so that 

« “^i^freject = n 0 „ I/A [I X, Z *'] (^(j) (}) 


If«= 0197, then k" = 6, and if# = 0781, then*' = 5 Fora = 05, there 
is no critical region C* and constant k* of the form gtven in the Neyman- 
Pearson lemma In this example our random variables are discrete and 
for discrete random variables it is not possible to find a and C* satisfying 
conditions (i) and (n) for an arbitrary fixed 0 < a < 1 In practice one is 
usually content to change the size of the test to some a for which a test 
given by the Neyman Pearson lemma can be found We might note, 
however, that a most powerful test of size a does exist The test would be 
a randomized test For the example at hand if we take a — 05, the 
randomized test with critical function 


<W*i, 


.*u>) 


f* 

05- 0197 
0584 

[o 


is the most powerful test of size a - 05 


if£x 4 = 5 


if£x,£4 


fill 


In closing this subsection, we note that a most powerful test of size a, 
given by the Neyman Pearson lemma, is necessarily a simple likelihood ratio 
test 


2 3 Loss Function 

As in the last subsection we assume that we have a random sample X lt , X„ 
from one or the other of the two completely known densiUes/ 0 ( ) =/( , 0 o ) and 
/i( ) =/( > 0i) On the basts of the observed sample we have to decide from 
which of the two densities the sample came, that is, we test 0 -0 o versus 
6 — Ox We can make one of two decisions, say d 0 or d lt where dj is the 
decision that /}( } is the density from which the sample came, j = 0, 1 We 
assume that a toss function is available 

Definition 10 Loss function In testing 0 = 0o versus 0 = 
define f(d„ Oj) - Joss incurred when decision d t is made and Oj is the true 
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parameter \aluc for i ^ 0, 1 andy — 0, 1, where d t is the decision of deciding 
that hypothesis eft j is correct. We will adopt the convention that 
(Kd k \ th) ~ 0 for i = 0, 1 and /(</,; Oj) > 0 for / * y. //// 

The values of the loss function are the amounts that are lost if decision d { 
is made when Oj was correct. With our convention, nothing is lost if the right 
decision is made, and a positive amount is lost if a wrong decision is made. If 
we think in terms of a (nonrandomized) test T having a critical region C r , then 
decision d x is made if the observed sample (x u x n ) belongs to C r , and 
decision d 0 is made if (x, , . * . . x n ) belongs to C v . A test can be thought of as a 
decision function since for a given observed sample the test tells us which decision 
to make. We do not consider the problem of selecting an appropriate loss 
function, and hence we will always assume that an appropriate loss function has 
been prescribed. Note that if decision d x is made when 0 o is correct, then a 
Type I error is made, and if decision d 0 is made when 0 X is correct, then a Type II 
error is made. 

In comparing two tests we naturally prefer that test which has smaller loss, 
and among all tests we would prefer that test which has smallest loss. However, 
seldom will there exist one test that has smallest loss for both possible decisions 
and for both 0 Q and 0 X * This motivates the defining of average loss ; and by 
continuing to borrow language from decision theory we define the risk function 
of a test. 

Definition 11 Risk function For a random sample X u X n from 
f {- ; 0 Q ) or /(« ; 0 X ), let T be a test of Jf 0 : 0 = 0 o versus jP, : 0 « 0, having 
a critical region C r . For a given loss function /(•; ’)> th e fish function 
of T, denoted by & r (0\ is defined to be the expected loss; that is 

a?r {0)=Jy f/(d, ;0) |jj/(x,;0)dx,] + 0)\j\Kxr,0)dx^.llH 

Remark 

5? r (0) = f(d t ; 0)Po\( X, X„) e C r ] + f(d 0 \ 0)P D [(X u ...,X n )e C r ] 

= /(</, ; 0 ) k ,( 0 ) + ad 0 \ 0)[ 1 - (6) 

that is. the risk function is a linear function of the power function, the 
coefficients in the linear function are determined by the values of the loss 
function. Since 0 assumes only two values, ^ r (0) can take on only two 
values, which are 

5? r (0 o ) = /(</, ;0 o )n r (0 o ) and = S(d 0 ; 0,)[1 - n r (0i)l (7) 
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Our object is to select that test which has smallest risk, but, unfortunately, 
such a test will seldom exist The difficulty is that the risk takes on two values, 
and a test that minimizes both of these values simultaneously over all possible 
tests does not exist except in rare situations {Sec Prob 3 ) Not being able 
to find a test with smallest risk, we resort to another less desirable criterion, that 
of minimizing the largest value of the risk function 

Definition 12 Mimmax test A test Y m of 0 0 ~ Q 0 versus 0 = 0 t 
is defined to be minima* if and only if 

max l®r M {0o)< £ max l& T (0 o \ 

for any other test Y //// 

The following theorem is sometimes useful in finding a mimmax test (As 
with Theorem 2, v,e state the theorem and proof in terms of nonrandomized 
tests) 

Theorem 3 For a random sample X lt , X a from /( , 0 O ) or /{ , 0 L ), 
consider testing 0 = 0 o versus 6 ** 0 S If a test Y„ has a critical 
region given by C„ = {(*,, ,x„) 2 <; k m ], where k m is a positive constant 

such that a x JQJ = 5?v w (0i), then Y M is mimmax. Recall that 

proof We will assume that /( , 0 o ) and /( , 0j) are probability 
density functions The proof for discrete density functions is similar 
Let Y be any other test with a critical region C T which satisfies 
^r(0o) 5 #r„(0o) [Note that if ^ r (0 o ) > 5? rm (0 o ), then Y would not 
even be a candidate for mimmax ] Wc have &t{6 0 ) £ ^r„(0o). 
Adi, 0 o )n Y (0 o ) ^ A<A; 0 <J )ir T J0 o ), or it Y (0 o ) ^ that is, Y has size 

less than or equal to that of Y„ But, by the Neyman Pearson lemma, 
Y„ is the most powerful test of size n rm (0 o ), hence 5 b t»(0:)» I — 
k t (0i) 2: 1 - k x j&), Wo. £(d 0 , or 

^r(0i)a^r„(0i). so. we have max [0t r JB o ), &rJPi)] =* ^ 

&x(0i) £ max [^ T (0 O ), ^r(^i)3 > that is, Y m is mimmax //// 

EXAMPLE 9 LetA',. , X m be a random sample from/(x, 0) = Oe~' tx I l0w<o) (x) 

For 0 Y > 0 O , test JZ* 0 0 = 0 o versus ^ 0 = 0( In Example 7, we found 
the most powerful size a test We seek now to find the mimmax test for 
a loss function given by Wi > 0 o ) = a and Wo« 0i) = A According to 
Theorem 3, the mimmax test T m is given by C m = {(x lt , *0 X <, k m ) 
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where Jr is such that a rJM _ s vM . { J < *j cm be rewriuen „ 
1E*« ^ k \ *or some constant k, and 0 rm (0 o ) = 0 T|n (0,) if and only if 
an rJ@o) — 4[I — so we seek k such that -T; < /t] = 

W’jiE x i > *]• & is given as the solution to " w ' 

“■i r5) e 5* - '‘e''“<is = hj’ j^«;s"-'e-*''d*. //// 

Before leaving minimax tests we make two comments: First, if / 0 (*) and 
Ji(') are discrete density functions, then there may not exist a such that 
^Tm(^o) — unless randomized tests are allowed; and, second, a minimax 

test as given in Theorem 3 was a simple likelihood-ratio test. 

In the above we assumed that X u X n was a random sample from 
/(• ; 0), where 0 = 0 O or 6 U and for each 0, /(• ; 0) is completely known. We 
also assumed that we had an appropriate loss function. Now, we further 
assume that 0 O and 9 X are the possible values of a random variable O and that 
we know the distribution of O, which is called the prior distribution, just as in 
our considerations of Bayes estimation. 0 is discrete, taking on only two 
values 0 O and 9 t ; so the prior distribution of 0 is completely given by, say, g, 
where g~P[0 = 0 X ] = 1 - P[Q = 0 O ]. We mentioned above that, in general, 
a test with smallest risk function for both arguments does not exist. Now that 
we have a prior distribution for the two arguments of the risk function, we can 
define an average risk and seek that test with smallest average risk. 


Definition 13 Bayes test A test of : ® *= versus x : 0 — 0 X is 
defined to be a Bayes test with respect to the prior distribution given by 
g = P[© = 0J if and only if 

(1 - gWr,(0 0 ) + g8 r,(0i) ^ (1 - d)^r(0 o ) + ( 8 ) 

for any other test T. Ml 

To find a Bayes test we seek a critical region C g that minimizes 

(1 - g)0t T (0 O ) + g@r(0i) = (1 - ffWi i °o)^riOo) + ffWo > °i)l l ~ 71 r(^)] 
= (1 — gW(di j 9 0 ) Jc L 0 + gtfa ; $i) jc 

as a function of the region C. Now 

(1 g)£{di i 0 o ) Jc Lq + gC{do ; 0]) Jc 

— gd(d 0 ; 0 X ) + Jc[(l -gy(d L ; 6 0 )L 0 -gt(d 0 ; OJL^, 


( 9 ) 
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which is minimized if C is defined to be all (*„ , x„) for which the last inte- 

grand of Eq (9) is negative, that is 

C, = {(*,. . x.) (1 - gy(J t , 0 o )L o - g«d Q , 0 t )L , < 0} (10) 

We have proved the following theorem 


Theorem 4 The Bayes test Y e of 0 ~ 0 o versus 0 = 0, with 
respect to a prior distribution given by g = P[0 = 0,] has a critical region 
defined by 


Cg ~ |(*1» 


A< 


gWo.Oi) ) 


(II) 

mi 


We note that once again a good test, m this case a Bayes test, turns out to 
be a simple likelihood ratio test The exact form of the Bayes test is given by 
Eq (U) 


EXAMPLE 10 Let X t , , X„ be a random sample from /(x,0) = 
Oe *7 (0 X ,(x) Test Jf 0 0 = 0 O versus y? x 0 = 0, The critical region 
of a Bayes test is given by 





, go exp(-0oXx,) q,0,) 1 

X " 01 exp (-0, X x,) (1 - £)/(</, » 0o)J 


,*„) 


I*i< 


Ot-Oo 


log. 


ff 0U(do,0 t ) ] 

(1 -g)0o^(d:,0 0 )f 


for 0, > 0 O HU 


3 COMPOSITE HYPOTHESES 

In Sec 2 shave ?;■<? considered testing a simple hypothesis against a simple 
alternative We return now to the more general hypotheses testing problem, 
that of testing composite hypotheses We assume that we have a random sample 
from /(x, 0) 0e0 and we want to test Jf 0 0 e 0 O versus 06 0,, where 
§o «=0, Si «=0 and S 0 are 0, are disjoint Usually0 ,=0 - 0 O We begin 
by discussing a general method of constructing a test 
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3.1 Generalized Likelihood-ratio Test 

For a random sample A , , . . . , ,V„ from a density /(.r; 0), 0 e 5, we seek a test of 
A ?' 0 .‘ OcQo versus -X r l :0cQ t 


V. Definition 14 Generalized likelihood-ratio Let Z.(0; 

be the likelihood function for a sample A^, .... A'„ having joint density 

A ( T i . • • • i at,; (?) where 0 C [). The generalized likelihood-ratio, 

denoted by A or A fl , is defined to be 


l <= P, 


sup L(0;x l , ...,x n ) 

6 ty>3 

sup 1(0; yj* 


( 12 ) 

llll 


* 

Note that ?. is a function ofx, .r„ . namely A(.r, , x„). When the 

observations arc replaced by their corresponding random variables A' t A„, 

then we write A for A; that is, A ~ /.(X,, .... X n ). A is a function of the random 
variables A'j, X, and is itself a random variable. In fact, A is a statistic 
since it docs not depend on unknown parameters. 

Several further notes follow: (i) Although we used the same symbol A to 
denote the simple likelihood-ratio, the generalized likelihood-ratio docs not 
reduce to the simple likelihood-ratio for £5 = (0 O , 0,}. (ii) A given by Eq. (12) 
necessarily satisfies OslA^l; A^O since we have a ratio of nonnegative 
quantities, and A <, I since the supremum taken in the denominator is over a 
larger set of parameter values than that in the numerator; hence the denominator 
cannot be smaller than the numerator, (iii) The parameter 0 can be vector- 
valued. (iv) The denominator of A is the likelihood function evaluated at the 
maximum-likelihood estimator, (v) In our considerations of the generalized 

likelihood-ratio, often the sample A', A'„ will be a random sample from a 

density f(x; 0) where 

The values A of the statistic A are used to formulate a test of jP 0 : 0 ® So 
versus Jf, : 0c&~O 0 by employing the generalized likelihood-ratio test prin- 
ciple, which states that is to be rejected if and only if A < A 0 , where A 0 is 
some fixed constant satisfying 0 & A 0 £ L (The constant A 0 is often specified 
by fixing the size of the test.) A is the test statistic. The generalized likelihood- 
ratio test makes good intuitive sense since A will tend to be small when Xd 0 is 
not true, since then the denominator of A tends to be larger than the numerator. 
In general, a generalized likelihood-ratio test will be a good test; although there 
arc examples where the generalized likelihood-ratio test makes a poor showing 
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compared to other tests One possible drawback of the test is that it is some- 
times difficult to find sup lid, X t , , *„), another is that it can be difficult to 
find the distribution of A which is required to evaluate the power of the test. 


EXAMPLE 11 Let X u , X n be a random sample from f(x, 0) 
6e~ *V( o .)(*)• where 5 = {0 . 0 > 0} Test 3 ? 0 0^0 o versus 3t’ l 6 >9 0 

supL[(0,x„ ,x„)] = sup [^exp (-0£*i)] = e~\ 

*«a » >0 \L x i' 

and 

sup [L(0, x 1( , x„)] = sup [0" exp (— 0 £ x,)] 

Mga 0<*5<a 


Hence 


f(s>) 

lf 

k «p(-o 0 E^) 

if 

(i 

if 



)0Sexp(-0 o 5:x,) 

n 


I* 0 


IfO<^ 0 < 1 then a generalized likelihood ratio test is given by the follow 
ing Reject ^f 0 if A 5 A 0 , or 


reject if 


I*. 


m 


exp (~0oI/> + n)£-*o. 

(14) 


or reject if 0 o x < 1 and (0 e jO* i l 0 Write y = 0 o x and 

note that y*e""<r-i> has a maximum for ,y=l Hence ^<1, and 
jC t - , rts- n ^ x 0 A anfi tafty ti y £ 5t, Wnere k is a constant satisfying 
0 < k < 1 See Fig 4 

We see that a generalized likelihood ratio test reduces to the follow 
mg 


05 ) 


Reject Jif’o if and only if 0 o x < k, where 0 < k < 1, 
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FIGURE 4 


that is, reject if =c is less than some fraction of 1/0 O . If that gen- 
eralized likelihood-ratio test having size a is desired, k is obtained as the 
solution to the equation 

a = P„ o [0 o X <*]« P 0o [0 o £*,< nk) m f k -L e - du. 

(Note that P O [0 O ^ < k] < P Oq [ 0 o X < k] for 9 ^ 9 0 .) //// 

We note that in the above example the first form of the test, as given in 
Eq. (14), is rather messy, yet after some manipulation the test reduces to a very 
simple form as given in Eq. (15). Such a pattern often appears in dealing with 
generalized likelihood-ratio tests — their first form is often foreboding, yet the 
tests often simplify into some nice form. We will observe this again in Sec. 4 
below when we consider tests concerning samplingfrom the normal distribution. 

We might note, by considering the factorization criterion, that a generalized 
likelihood-ratio test must necessarily depend only on minimal sufficient statistics. 

In Sec. 5 below, a large-sample distribution of the generalized likelihood- 
ratio is given. This will provide us with a method of obtaining tests with 
approximate size a. 

3.2 Uniformly Most Powerful Tests 

In Subsec. 3.1 above we exhibited a method of obtaining a family of tests of 
: 6 e § 0 versus #f x : 6 e 9 - S 0 • We now define one optimum property 
that such a test may possess. It is defined in terms of the power function 7C*r(0) 
and the size of the test. 

Definition 15 Uniformly most powerful test A test Y* of : 9 e S 0 
versus x : 9 e 5 - § 0 is defined to be a uniformly most powerful size-a 

test if and only if: 

(i) sup 7i T *(0) = a. 

(ii) iir%) > t, r (0) for all 0 e S - So and for any test Y with size 

less than or equal to a. NN 
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A test Y* IS uniformly most powerful of size a if it has size a and if among 
all tests of size less than or equal to a it has the largest power for all alternative 
values of 0 The adverb “ uniformly ” refers to “ all ** alternative 9 values A 
uniformly most powerful test does not exist for all testing problems, but when 
one does exist, we can see that it is quite a nice test since among all tests of size a 
or less it has the greatest chance of rejecting Ji ? 0 whenever it should 


EXAMPLE 12 Let X t , , X„ be a random sample from f(x t 0) = 
Oe~ 9x I { o „)(*) where 5 ={0 0 ^ 0 O } Find a uniformly most powerful 
test of y? 0 0 = 0 o versus 6>0 o For fixed 0, >0 o , we determined 
in Example 7 that the most powerful test of 0 9 = 0 o versus 0 = 0 X 
was given by the following Reject Jf 0 *f Z x t S k\ where k’ was given as 
a solution to the equation 

A l 

« = c- eoI dx 

T(n) 

Such a test was given by the Neyman Pearson lemma Note that the test 
m no way depends on 6 t except that 0 t > 0 Q , hence, we would get the same 
most powerful test for any 6 X > 0 O , and thus the test is actually uniformly 
most powerful! //// 

The above example provides us with an example of a situation where a 
uniformly most powerful test can be obtained using the Neyman Pearson lemma. 
That same technique can be used to find uniformly most powerful tests m more 
general situations, such as those given in Theorems 5 and 6 below, which are 
given without proof 

Theorem 5 Let X x , , X m be a random sample from the density/(x, 0), 
0 e §, where g is some interval Assume that /(*, 9 ) » 

a(0)b(x) exp [c(0M*)J and set /(*„ ,*„)=£ d(x t ) 

(i) If c(0) is a monotone, increasing function in 0 and if there exists 
k* such that P, a [/(X lt , X„) > k*] = a, then the test Y* with a critical 
region C* = {(*„ , x„) /{x it , xj > k*} is a uniformly most power 

ful size-a test of 0 £ 0 O versus 0>Q 0 or of 0 = 0o 
versus X 0 > 0 O 

(lt) If c(0) is a monotone, decreasing function in 0 and if there 
exists k* such that jy/( X lt , X K ) < ft*] = or, then the test Y* with a 
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critical region C* — {(* 1 , . • . , * n ) : S(x u . . . , x n ) < k*} is a uniformly most 
powerful size-a test of jf 0 : 0 < 0 O versus JT X : 6 > 0 o or of 0 = 0 O 
versus ^ : 0 > 0 o . j jjj 

EXAMPLE 13 Let X u X n be a random sample from /(x; 0) = 
0«"%. eo) W» where S ={0:0 >0}. Test .Jf 0 : 0 < 00 versus x : 0 > 0 O . 

/(x; 0) = 0/(o. oo) W exp (-0x) = a(0)6(x) exp [c(0)</(x)]; so /(x x , xj 

n 

= 1] and c(0) = —0. c(0) is a monotone, decreasing function; so 

by (ii) of Theorem 5 a uniformly most powerful test is given by the follow- 
ing: Reject 0 if and only if £ < k *, where /c* is given as a solution to 

« = PeoE < **] = r rTT °S K n - 1 e- ( ’° u <fii. //// 

' / 0 t (flj 

Definition 16 Monotone likelihood-ratio A family of densities {/(*; 0): 
0e§, 5 an interval} is said to have a monotone likelihood-ratio 
if there exists a statistic, say T = /(A r 1 , X n ), such that the ratio 

1(0'; x u x n )/L{0*; x u is either a nonincreasing function of 

/(*!, ...,x n ) for every 0' < 8 n ora nondecreasing function of J(x u . x n ) 
for every O' < 8\ //// 

r 

Note that in the term “monotone likelihood-ratio” the likelihood-ratio is not 
a generalized likelihood-ratio; it is a ratio of two likelihood functions. 


EXAMPLE 14 If {f{x; 8 ) : 0 e S} = {0e’%, <*»(*) : 9 > 0}, then 


L(0'; x lf . . • , x„) _ (0*)" exp(-0'£x,) 
L(0’;x 1 ,...,X fl ) (0T exp(-O’X^) 


exp[-(0'-0')X>i] 


which is a monotone, increasing function in £ Xj . 


//// 


EXAMPLE 15 If {/(x; 0): 0 e §} = {(I/0)/co,e)(*) : ® > °}> then 


L(0 j Xj, .*■) x„) 
L(0 j X(, •*•} x n } 


(i/0o-nw>(*<) 

i=j , 

i=l 


(l/flQ"W)0v> 

‘ (l/fl'O’VnO'n) 



for 0 < >»„ < 0' 
for 6’<y n <B", 
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which is a monotone, nonincreasing function m y„ — max [x t , , x.] 
[Note that y % cannot fall outside of the interval (0 O') when 0 is either 6 
or O'] HU 

Theorem 6 Let X lt , A", be a random sample from /(x, 0 ) where § is 
some interval Assume that the family of densities { f(x , 0) 0 e Si has a 
monotone likelihood ratio m the statistic l{X t , , XD 

(i) If the monotone likelihood ratio is nondecreasing in /(x t , 
, xj and if k* is such that P tt [i(X u , XJ < k*\ = a, then 
the test corresponding to the critical region C* = {(x„ , x*) 

/(xj, , x„) < k*) is a uniformly most powerful test of size a of 

6 :£ 0o versus 0>0 o 

(n) If the monotone likelihood ratio is nomncreasing in /(x, 

, x,) and if k* is such that Pi t U(X it , A",) > £*] = a, then the 
test corresponding to the critical region C* ={(x 1( , x„) 4(x lt 

, x^>k*) is a uniformly most powerful test of size a of 0 £ 0 O 
versus 0 > 0 O llll 


EXAMPLE 16^ Let X x , , X„ be a random sample from /(x, 0 ) « 
(l/0)/ t o tM) where 0 > 0 Test 0S0 O versus^ 9>9 0 We 
saw m Example 15 that the family of densities has a monotone nonmereas 
mg likelihood ratio in /(x lt , x„) =>*„ = max [x t , , x„] According 
to (u) ^f Theorem 6 a uniformly most powerful size-ct test is given by the 
following Reject Jf 0 if y, > k*, where k* is given as the solution to 


• - w > n - % - $ m - <wi - 1 - i. 


which implies that k* ~0 o V 1 — cc 


llll 


Several comments are in order First the null hypothesis was stated as 
0 :S 0 o ,tt both Theorems 5 and 6 , if it had been stated as 0 2 ; 0 O , the two theorems 
would remain valid provided the inequalities that define the critical regions were 
reversed Second, Theorem 5 is a consequence of Theorem 6 Third, the 
theorems consider only one sided hypotheses 

This completes our bnef study of uniformly most powerful tests We have 
seen that a uniformly most powerful test exists for one sided hypotheses if the 
density sampled from has a monotone likelihood ratio m some statistic There 
are many hypothesis testing problems for which no uniformly most powerful 



3 


COMPOSITE HYPOTHESES 425 


test exists. One method of restricting the class of tests, with the hope and 
intention of finding an optimum test within the restricted class, is to consider 
unbiasedness of tests, to be defined in the next subsection. 

3.3 Unbiased Tests 

There are many hypotheses-testing problems for which a uniformly most powerful 
test does not exist. In these cases it may be possible to restrict the class of tests 
and find a uniformly most powerful test in the restricted class. One such class 
that has some merit is the class of unbiased tests. 

Definition 17 Unbiased tests A test T of the null hypothesis Jt? 0 : 9 e 0 O 
against the alternative hypothesis : 0 e ©! is an unbiased test i f and only 
if 

sup 7 i T (0) < inf iij{0 ). //// 

0 e 00 0 £ 01 

Consequently in an unbiased test the probability of rejecting #£ 0 when it is 
false is at least as large as the probability of rejecting 2 ? 0 when it is true. In 
many respects this seems to be a reasonable restriction to place on a test. If 
within this restricted class a test exists that is uniformly most powerful, then we 
have a uniformly most powerful unbiased test. An elaborate theory has been 
developed for finding uniformly most powerful unbiased tests, but we will not 
study it. See [16]. 

3.4 Methods of Finding Tests 

We presented in Subsec. 3.1 the generalized likelihood-ratio principle; it provides 
us with one method of obtaining tests of hypotheses. In Subsec. 3.2 we gave a 
method of finding a uniformly most powerful test for certain testing problems. 
There are still other methods of finding tests. One method is sketched on pages 
456 to 459 of Subsec. 5.4 of this chapter. Another method, which might be 
called the confidence-interval method , is to use a confidence interval to obtain a 
test. For instance, if it is desired to test «2f 0 : 9 = 9 0 versus : 9 ^ 9 0 , then 
we might compute a confidence-interval estimate of 9 from the data, and if the 
interval contains 9 0 , accept 2?$, and otherwise, reject it. If the confidence 
interval had a confidence coefficient y, then the resulting test would have size 
1 — y. We will say more about how confidence intervals might be used to 
obtain tests in Sec. 6 below. 

A useful and intuitive technique for obtaining tests is the following: 
Discover some statistic which behaves differently under the two hypotheses, and 
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utilize the different behavior to design a test As an illustration, consider testing 

0 <,0 o versus 6 >0 o , where the sample X lt , X„ is selected from 
the density f(x, 0) = tf> t ,(x) The statistic X has a normal distribution with 
mean 0 and variance I/n, hence the statistic X will tend to be smaller when 

is true than when is false The statistic X behaves differently under the two 
hypotheses A reasonable test then would be to reject for X large , that is 
reject if X >k where k is determined by, say, fixing the size of the test 
(We know from Subsec 3 2 above that such a test is uniformly most powerful ) 
To employ this technique a statistic has to be discovered which will behave 
differently under the two hypotheses There are various ways of approaching 
the task of discovering such a statistic For instance, if a sufficient statistic 
exists then it is a natural candidate to try , or a good estimator, such as a maxi 
mum likelihood estimator of the parameter or parameters that are used to 
specify the hypotheses is another possibility for the needed statistic In the 
above simple illustration X was all these since X is the maximum likelihood 
estimator of 0 as well as being a sufficient statistic We make frequent use of 
this intuitive technique for obtaining tests in the remaining sections 

EXAMPLE 17 Let X t , X„ be a random sample from a Poisson distribution 
with mean 0 Suppose that it is desired to test that the mean is a fixed 
value say 0 o that is test 9 — 0 O versus 0 ^0 o We know that 
X is the maximum likelihood estimator of 9 and that X will tend to be 
distributed about 0 O if jf 0 is true Consequently, the following test seems 
reasonable Accept Xf 0 if q < 3c < c 2 , and otherwise reject it, where c, and 
c 2 are selected so that the test will have a desired Size To be specific let 
n = 10 and <?„ *» 1 The test given by ‘'Accept if and only if 4<x< 

1 6 has size given by 

I-7W4<*<16] = 1— P, ,14 <:£*,< 16] 

15 e~ 10 IfV 

=i“5~*° 7 s mi 

A test that has been quite extensively applied in various fields of science is 
O-0o against 0^0 o For example, let 6 be the mean difference of 
yields between two varieties of wheat It is often suggested that it is desirable 
to test the hypothesis J^ a 0~O against 0^0, that is, to test if the two 
varieties are different in their mean yields However, m this situation, and 
many others where 0 can vary continuously in some interval, it is inconceivable 
that 0 is exactly equal to 0 (that the varieties are identical in their mean yields) 
Yet this is what the test is stating Are the two mean yields identical (to one 
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ten-billionth of a bushel, etc.) ? In many cases it seems more realistic for an 
experimenter to select an interval about 0 O , say 0 l <0 o <0 2 , and test 3F 0 : 0 l < 
8 < 6 2 against the alternative j : 0 < 0 1 or 8 > 0 2 . For example, it may be 
feasible to set 0 l = — jr and 8 2 =\ in the above illustration and test if the 
difference of the mean yields of the two varieties is between bushel and +£ 
bushel against the alternative that it is not in this interval. A test that is 
uniformly most powerful for the above hypothesis may be difficult or impossible 
to devise, but if/Xx; 0 ) is a density with a single parameter, then the maximum- 
likelihood estimator © may sometimes be used to construct a test and the power 
of this test compared with the ideal power function for a test of size a. A test 
of the following form may be used for some densities: Reject if 9 is not in 
some interval, say (c l5 c 2 ), and accept yf 0 if 8 is in the interval, where and c 2 
are chosen so that the test has size a. Often c 1 and c 2 can be chosen so that 

f7s(0; 0i) dd = [%0; 0 2 ) dO m 1 - a, 

J C\ 1 

where /$(§; 0) is the density of O when 6 is the parameter. The power function 
of this test is 

7 i(0) = 1 - C 7e(0; 8) $ for 8 in S. 

J C t 

This power function can be compared with the ideal power function, and if it 
does not deviate further from the ideal than the experimenter can tolerate, the 
test may be useful even though it may not be a uniformly most powerful test. 
Let us illustrate the above with a simple example. 

EXAMPLE 18 Let X u ..., X„ be a random sample from 4>e.i( x )- Test 
2? 0 :\ <,8 <2 versus i : 8 < 1 or 0 > 2. X is the maximum-likelihood 
estimator of 0; it has a normal distribution with mean 0 and variance l In. 
According to the above we would like to select Ci and c 2 so that 

1 - a = / Vi. !/«(*) dx = j 4>Z, Unix) dx. 

J C l 

We have 



and we can see from Fig. 5 that f — rfand c 2 — \ + d, where d is given 
by, say, 
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FIGURE 5 

\ 


05 

FIGURE 6 0 1 2 3 " 

For example, if am 05 and n~ 16, then d& 911, so Cjss 589, and 
c a as 2 411 The power function is given by 

Jt(0) « 1 - P t [Ci, < X < c 2 ] = 1 - P 9 [ 589 < X < 2 411] A 
and is sketched m Fig 6 * ^ //// 

4 TESTS OF HYPOTHESES— SAMPLING 
FROM THE NORMAL DISTRIBUTION 

A number of the foregoing ideas are well illustrated by common practical testing 
problems— those problems of testing hypotheses concerning the parameters of 
normal distributions The section is subdivided into four subsections, the first 
two dealing with just one normal population and the last two dealing with several 
normal populations 

4 I Tests on the Mean 

We shall assume that we have a random sample of n observations X u .... X, 
from a normal population with mean ji and variance tf 1 , and we will be interested 
m testing hypotheses about n There is quite a variety of hypotheses about the 
mean n that can be formulated, we begin by considering one-sided hypotheses 
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tfo'.fiz }i 0 versus ffi : n > N i n testing X 0 in£[t 0 versus X x : n > {i 0 there 
arc two cases to consider depending on whether or not <j 2 is assumed known. 
If cr is assumed known, our parameter space is the real line, and we are testing a 
one-sided hypothesis; so we have hope of finding a uniformly most powerful 
test. Since a is assumed known, it is a known constant; hence 

f(x\ 0 ) = = — L- 

V^TICT 

Jlno 

which is a member of the exponential family with 

a^) = —r, b(. x) = e~ i(xM \ c(/<) = -, and d{x) = x. 
yj 2na g 

The conditions for Theorem 5 are satisfied; so the uniformly most powerful size- 

a test is given by the following: Reject X 0 if/(x„ . . . , x„) = £ x f > k*, where k* 

i 

is given as a solution to P„E Xt>k*\ Now a = P yo [£ A', > k*] = 
1 — <h((k* — itfl 0 )ly/na ) ; so (k* - n^ 0 )j^fna = Zj _ a , where Zj _ a is the (1 — a)th 
quantile of the standard normal distribution. The test becomes the following: 
Reject X 0 if £ x, + yfnoz x _ c , or reject X 0 if x >^ 0 + (e/y/n)z i_ a . 

If cr 2 is assumed unknown, then testing l l ~ /'o versus ^ > // 0 is 
equivalent to testing Jf 0 : 0 e 5 0 versus : 0 £ § 0 > where 0 = Ox, cr 2 ), 5 = 
(Oner 2 ): —oo <p.<co]o 2 > 0), and S 0 ={(^, cr 2 ): /x < p 0 : > 0}- Toobtain 
a test, we could use the generalized likelihood-ratio principle, or we could find 
some statistic that behaves differently under the two hypotheses and base our 
test on it. Such a statistic is T = (X - ;< 0 )/( s /\/«)> where X is the sample mean 
and S 2 is the sample variance. Since T would tend to be larger for fi> fi 0 than 
for n £ fj 0 , a test based on T is given by the following: Reject X 0 if T is large; 
that is, reject X 0 if T > k. If n = n 0 > then P has a / distribution with n - 1 
degrees of freedom; so k can be determined by setting a = > k], which 

implies that k = /, -Jn - 1), the (1 - a)th quantile of a t distribution with n - 1 
degrees of freedom. It can be shown that the test derived here is a generalized 
likelihood-ratio test having size a. 

X 0 : fi = fi 0 versus X% '■ fi 7 = /to Again, we have two cases to consider depend- 
ing on whether or not c 2 is assumed known. For c 1 known, we know that 
(X - z ( 1 +?)/ 2 (<j/ N /«), X + z (I+T)/ 2 (ff/v^)) is a 100 y percent confidence interval 
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for n, where z {l+y)/2 is the [(I + y)/2]th quantile of the standard normal distribu- 
tion A possible test is given by the following Reject if the confidence 
interval does not contain /i 0 Such a test has size 1 — y since 

“ % +J)/J <(i 0 <% + 2( I+7) /2 = y 

If a 2 is assumed unknown we could obtain a test, similar to the one above, 
using the 100y percent confidence interval 

- t(i+,)/2(« - 1) -j=, X + *a+»/2(n — O 

Instead, let us find a generalized likelihood ratio test 


So={(H><t 2 ) = <t 2> 0} and 5 ={(/!, <r 2 ) -oo <n<co,o 2 >0} We 

have already seen that the values of ft and o 2 which maximize Up,o 2 ,x t , , 

m S are p. = x and & 2 = (I/n)£ (x, — jc) 2 , so 


suplo^’.x,, 

To maximize L over 0 O , we put /t = n 0 , and the only remaining parameter is 
a 1 , the value of a 2 which then maximizes L is readily found to be o 2 *» 
(l In) £ (x t - ft a ) 2 , which gives 


supZ,(/i, a 2 , x u 
§0 


.**) = 




The generalized likelihood ratio is then 


; „ [Z^izlLr 12 = r t.(x,-x ) 2 r 2 
LZ(*i-Po) 2 J IZ (*<“*+* -von 

_ r Yix.-x) 2 v ' 2 _ r i i 

lZ (*i - x) 2 + n(x - fi 0 ) 2 J [l + n(x - fi 0 )7Z (*i - 3 c) 2 J 


We note now that A is a monotonic function of 


t 2 = /*(*!, 


n(n — l)(x — /i 0 ) 2 


( 16 ) 
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and so a critical region of the form X < X 0 is equivalent to a critical region of the 
form t (*i, ...,x n )>k 2 . A generalized likelihood-ratio test is then given by 
the following: Reject if and only if 


T 2 



> k 2 , 


or accept if and only if — k < T <k. Since J'has a t distribution with n — 1 

degrees of freedom when fi = p 0 , if k is selected so that n — l)dt = 

1 - a, then our test will have size a. k is given by t t _ af2 (n - 1), the (1 - a/2)th 
quantile of a f distribution with /i - 1 degrees of freedom. We might note that 
this size-oc test obtained by using the generalized likelihood-ratio principle is the 
same size-oc test that we obtained above using the confidence-interval method 
of obtaining tests with the confidence interval 


/_ S — S \ 

[X — ~ 1 ) X + *(1 + y)fl( n — 1 ) — 7 = )» 


where y = 1 — oc. Although we will not prove it, the test that we have obtained 
is uniformly most powerful unbiased. 

We have found tests on the mean of a normal distribution for both one- 
sided and two-sided hypotheses. One might note that the one-sided null hy- 
pothesis fi< ft 0 could be reversed and comparable results obtained. There are 
other hypotheses about the mean that could be formulated, such as 3tf 0 : ^ 

/z < n 2 versus or /r > \i 2 • 


4.2 Tests on the Variance 

As in the last subsection, we shall assume that we have a random sample of size n 
from a normal population with mean ^ and variance a 2 . We will be interested 
in testing hypotheses about a 1 , 

ffo'.c 2 <><*1 versus tf^c 2 >o% Therearetwo cases to consider depending on 
whether or not fi is assumed known. If \i is known, then our parameter space 
is an interval, and our hypothesis is one-sided; so we have a chance of finding a 
uniformly most powerful size-oc test. 

which is a member of the exponential family with o(cr 2 ) = (27i<7 ) b(x ) = 1, 
da 2 ) = - 1 /2a 2 , and d(x) = (*- /0 2 * I > « known; so <f(x) is a function of * 
only.] da 2 ) is a monotone, increasing function in <r 2 ; so, by Theorem 5, the 
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test with critical region = {fo, , *„) £ (x, - fi) 2 > k *} is uniformly most 
powerful of size a, where k* is given by jP,j- #0 *K (*» — f*) 1 > **] = <*, which 
implies that k* = olxl-M, where xl is the (1 - a)th quantile point of the 
chi square distribution with n degrees of freedom 

If /ns unknown a test can be found using the statistic V — — X) 2 /al 

Fwill tend to be larger for a 2 > cl than for a 2 ^ cl, so a reasonable test would 
be to reject JP 0 for V large If o 2 = cl, then V has a chi-square distribution 
with Ti — l degrees of freedom, and ■P # *. ## i[F , >x*_ a (»~I)]=sB f where 
Xl- a (n — I) is the (1 — a)th quantile of a chi-square distribution with n - I 
degrees of freedom It can be shown that the test given by the following 
Reject if and only if £ (X, — X) 2 /cl > xi-a( n — 1) is a generalized likeli- 
hood ratio test of size a 

c 2 = cl versus a 2 =£ cl We leave the case /i assumed known as an 
exercise For /r unknown, so that 5 0 — {(/i, a ) — oo</i<oo;cr 2 = <^}, we 
can find a size a test using the confidence interval method In Subsec. 3 2 of 
Chap VIII, we found the following lOOy percent confidence interval for c 2 

/(H-I)S a (i» - l)S a \ 

\ ?z ’ <h Y 

where ft and q 2 are quantile points of a chi square distribution with n — 1 degrees 
of freedom, say/ e (ft n - 1), satisfying 


I /c (q,n-l)dq**y 

•'si 

A size (a « 1 - y) test is given by the following Accept if and only if a 2 is 
contained in the above confidence interval It is left as an exercise to show that 
for a particular pair of ft and q 2 the test of size a derived by the confidence 
interval method is in fact the generalized likelihood ratio test of size a 


4.3 Tests on Several Means 

In this subsection we will consider testing hypotheses regarding the means of two 
or more normal populations We begin with a test of the equality of two means 

Equality of two means In many situations it is necessary to compare two 
means when neither is known If, for example, one wished to compare two 
proposed new processes for manufacturing light bulbs, one would have to base 
the comparison on estimates of both process means In comparing the yield of 
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a new line of hybrid corn with that of a standard line, one would also have to use 
estimates of both mean yields because it is impossible to state the mean yield of 
the standard line for the given weather conditions under which the new line 
would be grown. It is necessary to compare the two lines by planting them in 
the same season and on the same soil type and thereby obtain estimates of the 
mean yields for both lines under similar conditions. Of course the comparison 
is thus specialized; a complete comparison of the two lines would require tests 
over a period of years on a variety of soil types. 

The general problem is this: We have two normal populations — one with 
a random variable X^, which has a mean /q and variance o\> and the other with 
a random variable X 2 , which has a mean /i 2 and variance o \ . On the basis of 
two samples, one from each population, we wish to test the null hypothesis 

^0 ^1 = ^2 * > 0*^2 > 0 versus : /q # fi 2 , c\ > 0, c\ > 0. 

The parameter space 5 here is four-dimensional; a joint distribution of X x and 
X 2 is specified when values are assigned to the four quantities (/q, /q, c\, c\). 
The subspace j3 0 is three-dimensional because values for only three quantities 
(/ i, c\, <r§) need be specified in order to specify completely the joint distribution 
under the hypothesis that /i, = fi 2 = ji, say. 

We shall suppose that there are n x observations ( X n , X 12t ...» X lnt ) in 
the sample from the first population and n 2 observations (X 21 , X 22 , X 2ni ) 
from the second. The likelihood function is 


L&U 1*2 > > -* 1 1 > * • * > * 21 > • • * y %2ni) 


and its maximum in S is readily seen to be 


sup L = 
§ 



- 

ni/2 

r ”2 

2 71 2 (*n ” 
L 1 

*i)\ 


£ (x 2j — x 2 ) 2 

L i J 


e~ n > 12 . 


If we put /q and /q equal to ji, say, and try to maximize L with respect to /q c x , 
and cl, it will be found that the estimate of /z is given as the root of a cubic 
equation and will be a very complex function of the observations. The resulting 
generalized likelihood-ratio X will therefore be a complicated function, and to 
find its distribution is a tedious task indeed and involves the ratio of the two 
variances. This makes it impossible to determine a critical region 0 < X < k 
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for a given probability of a Type I error because the ratio of the population 
variances is assumed unknown A number of special devices can be employed 
in an attempt to circumvent this difficulty, but we shall not pursue the problem 
further here For large samples the following criterion may be used The root 
of the cubic equation can be computed in any instance by numerical methods, and 
A can then be calculated, furthermore, as we shall see in Sec 5 below, the 
quantity —2 log A has approximately the chi square distribution with one 
degree of freedom, and hence a test that would reject for —2 log A large could 
be devised 

When it can be assumed that the two populations have the same variance, 
the problem becomes relatively simple The parameter space 5 is then three 
dimensional with coordinates (/ij, /i 2 , a 1 ), while 5 0 for the null hypothesis 
Pi D P-i D P ,s two dimensional with coordinates (p, a 1 ) In § we find that 
the maximum likelihood estimates of /i,, p 2 , and a 2 are, respectively, x,, x 2 , 
and 

Hi [t (*H ~ *l)* + f (*2J “ *j) 2 ] . 
so 


sup L = 


sup 


"1 + 


tea[X(*ii — Si) 1 — * 2 ) 2 ]/ 

In § 0 , the maximum likelihood estimates of n and a 2 are 
n t x t + n 2 x z 




+n z 


for n 


and 


%7 k [**-«■ w ■ - w] 

fore 1 . 


which gives 
supL 
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Finally, 


;.= 



[HiH 2 /(Ni + Wz)]^ - X;) 2 \ -(»i+»j )/2 

L (*l I “ ^l ) 2 + £(*2 J ~ *2)7 


(17) 


This last expression is very similar to the corresponding one obtained in 
Subsec. 4.1, and it turns out that this test can also be performed in terms of a 
quantity which has the l distribution. We know that X, and X 2 are independ- 
ently normally distributed with means /?, and /< 2 and wi(h variances u 2 /«i and 
v 2 l>i 2 * Also it is readily seen that Xy — X 2 is normally distributed with mean 
Pi ~ Pi and variance <t 2 (1/hj + 1 /n 2 ). Under the null hypothesis the mean of 
X, - X 2 will be 0. The quantities £ (X u - Xy) 2 lo 2 and £ ( X 2J - X 2 ) 2 /a 2 are 
independently distributed as chi-square distributions with tty - 1 and n 2 - 1 
degrees of freedom, respectively; hence their sum has the chi-square distribution 
with //, + n 2 — 2 degrees of freedom. Since under the null hypothesis 


Z= 


Vsf l/«i + l/n 2 


is normally distributed with mean 0 and unit variance, the quantity 

_ s/'h'hKny + «T)(^i - ^2) ( 18 ) 

VE Wi - -^i) 2 +Z (^2 i ~ x 2) 2 ]/( w i + 11 2 — 2) 

has the t distribution with «, + n 2 - 2 degrees of freedom. [Note that we do 
have independence of the numerator and denominator in Eq. (18).] The gener- 
alized likelihood-ratio is 



1 +[/ 2 /(hj + n 2 — 2)] 


(r» 1 +nz)/Z 


(19) 


and its distribution is determined by the t distribution. The test would, of 
course, be done in terms of T rather than L A 5 percent critical region for T is 
T 2 > [7.975(1 ?! + h 2 - 2)] 2 , where /. 975 (/? 1 + >h - 2) is the ,975th quantile of the 
t distribution with «i + n 2 — 2 degrees of freedom. 

If we want to test XCq'. p 2 — p 2 versus XC Pi > Pi or X/fg'. Pi ^ Pi versus 
JT,: //, >n 2 , a size-a test is given by the following: Reject Xf 0 if and only if 
T> t y _ a (/7, + n 2 - 2), where 7 is defined in Eq. (18) and t t + n 2 - 2) is the 
(1 - a)th quantile of the t distribution with n 2 +n 2 - 2 degrees of freedom. 


Equality of several means The test presented above can be extended from 
just two normal populations to k normal populations. We assume that we have 
available k random samples, one from each of A: normal populations, that is, 
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let Xj U . , X Jm) be a random sample of size n } from theyth normal population, 
7 = 1, , k Assume that the yth population has mean ft and variance a\ 

Further assume that the k random samples are independent. Our object is to 
test the null hypothesis that all the population means are the same versus the 
alternanve that not all the means are equal We seek a generalized likelihood- 
ratio test The likelihood function is given by 


Ujii. 


x* t x*J 

j-i i-i y/2na 

- (2™ J r" J exp E(*ji - ft)*]. 


where n = 


i 


n j 


The parameter space § is ( k 4- 2)-dimen$ional with coordinates 
(ft, , ft , a 2 ), and S 0 , the collection of points in the parameter space corre- 
sponding to the null hypothesis, is two-dimensional with coordinates (ft p 2 ), where 
p = n t = • ~(i k In S, the maximum-likelihood estimates of ft, ft, a 2 
are given by 




*£=;;£ 2 (fti -*/)*; 


( 20 ) 


i*2^E ( x jt-xj)\-*n 

^ p 4=piL_ -] «-*. 

In S 0 , the maximum likelihood estimates of ji and cr 2 are 

A = S = ;II x ji and “ E E(*yi - *) 2 , 


r 2«E E (ft* 

] e -./. 
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The generalized likelihood-ratio is then 


Z I (*/< ~ x) 2 

j i 


sup L 

^ Qo . 

supL 

© 


ZZCxji-xjy 


LJ i 


-n/2 


ZZ(*;r*;. + ^.-3c) 2 T»/2 


ZZ ( X jl-Xjf 


J i 


Z Z (*/i ~ */.) 2 + X «/*,. - x) 2 l -»/2 

J 


Z Z (Xji ~ Xjf 


/c _ 1 Z n j( x y. - x) 2 l(k - 1) 1 -"I* 

1 + n - k X Z (*ji - */.) 2 /(« - k) 


j l 


A generalized likelihood-ratio test is given by the following: Reject Xf 0 if and 
only if X <, X 0 . But A ^ A 0 if and only if 


Z n j(xj. ~ x) 2 l(k - 1) 

r = > some constant, say c. (21) 

Z Z (xji - x j.)K n - k ) 

j i 

The ratio r is sometimes called the variance ratio , or F ratio. The constant c is 
determined so that the test will have size a; that is, c is selected so that 
P[R^. c\Jf 0 ] = a. Note that Zy. is independent of £ -Xj) 2 and, hence, the 

i 

numerator of Eq. (21) is independent of the denominator. Also, under ^f 0 , 
note that the numerator divided by a 2 has a chi-square distribution with k — 1 
degrees of freedom, and the denominator divided by a 1 has a chi-square dis- 
tribution with n - k degrees of freedom. Consequently, if is true, R has an 
F distribution with k - 1 and n - k degrees of freedom; so the constant c is the 
(I - a)th quantile of the ^distribution with k - 1 and n - k degrees of freedom. 

The testing problem considered above is often referred to as a one-way i 
analysis^of variance. In some e'xperimentarsifuations; an "experimenter is 
interested In determining whether or not various possible treatments affect the 
yield. For example, one might be interested in finding out whether various 
types of 'fertilizer applications affect the yield of a certain cropV The different 
treatments correspond "to the different populations, and when we' test that there 
is no population difference, we are testing that there is no “treatment” effect. 
The term “analysis of variance” is explained if we note that the denominator of 
the ratioln Eq. (2I) is an estimate of the variation within population? and the 
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numerator is an estimate of the variation between populations when means are 
equal We are analyzing variance to test equality of means 


4 4 Tests on Several Variances 


Two variances Given random samples from each of two normal populations 
with means and variances Qi„ of) and fe. of), we may test hypotheses about 
the two variances We will consider testing 

(i) of £ c\ versus 3C X a\ > a\ 

(ii) ft ? o c\ ^ a\ versus o\ < of 

(ni) °i = °i versus °i * °* 

If Xi j, , X u is a random sample from a normal density with mean 

and variance c \ , if Xn* , X Ul is a random sample from a normal density 
with, mean p 2 and variance of , and if the two samples are independent, then we 
know that 

«*] 


has the F distribution with n, — 1 and n 2 — 1 degrees of freedom and, in partic- 
ular, the statistic 


("i-DEW'ji-^) 2 


( 22 ) 


has the F distribution with n, — 1 and n 2 — 1 degrees of freedom when of = c\ 
Note that the statistic R tends to be large when a f > of and small when a\ < er| , 
and so we can capitalize on this different behavior to formulate tests for the 
hypotheses (i) to (m) For instance, in testing of <, c 2 versus of > of , 
we would reject jF 0 for large R, or a size-a test is given by the following Reject 
if and only if R exceeds F,..^ — I , n 2 — 1), the (I — a)th quantile of the 
F distribution with rt 2 — I and n 2 — 1 degrees of freedom Similarly, a test of 
cl S a\ versus i of < of is given by the following Reject Jf 0 if and 
only if R is less than F a (n, — 1, n 2 — I), the ath quantile of the F distnbution 
with «j — I and n 2 — I degrees of freedom A test of Jf 0 a\ = al versus 
of ± c\ should be lv.o-taikd , that is, should be rejected for small or 

large R In other words, a test is given by the following Accept Jf 0 if a n ^ onl y 
if < R < k 2 , where k x and k 2 are selected so that the test will have size at 
* It is customary to make the two tails have equal areas of a/2 (although this 
is not quite the best test), then k t = F al2 (n 2 — I, » 2 - I), and * 2 = 
■fi-./ 2 («i -Un 2 - 1) 
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We might mention that the above defined tests can all be derived using the 
generalized likelihood-ratio principle. 


Equality of several variances Let X Jt , . . . , A j nj be a random sample of size n j 
from a normal population with mean //, and variance a),j = 1, , . fc. Assume 
that the k samples are independent. Our object is to test the null hypothesis 
J^ 0 : o\ = g\ = ■ * • = g\ against the alternative that not all variances are equal. 
The likelihood function 


• • • > f l ki Gli • • Gfc > **•) *lm> • • * t Xklt *•*> Xhn k ) 

= ft ft L_ 

i-1 Jin Gj 

and the maximum-likelihood estimates of n Jy o),j = 1 , k are given by 


and 


1 "J 

h - ~ L x ji = xj. 

tlj 1=1 



Lixjt-Xj.) 2 . 


i = I 


The null hypothesis states that all g) are equal. Let g 2 denote their common 
value; then g 0 = {(/q, . . . , //* , o' 2 ): - co < \i } < cx> ; g 2 > 0), and the maximum- 
likelihood estimates of p i9 . . . , ft k ,G 2 over 0 O are given by 


ft j X , y — 1 » » • • > ky 


and 


1 k nj 

L"jj=i i=l 


YjiS. 

I «J ’ 


Therefore, 


sup L 

^ Qq 

sup L 

Q 


ex P (-I";/ 2 ) 
ft (i) nj,2 exp(-I« y /2) 

;=i W/' 


ft (Gj) nj ' 2 
= (I 

A generalized likelihood-ratio test is given by the following: Reject Jf 0 if and 
only if A < A 0 . We would like to determine the size of the test for any constant 
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X 0 or find ;. 0 so that the test has size «, but, unfortunately, the distribution of the 
generalized likelihood ratio is intractable An approximate size-a test can be 
obtained for large n } since it can be proved that —2 log A is approximately 
distributed as a chi-square distribution with ft — 1 degrees of freedom Accord- 
ing to the generalized likelihood ratio principle J^o is to be rejected for small 
X , hence J^ 0 should be rejected here for large —2 log 2 , that is, the critical region 
of the approximate test should be the right tail So the approximate size-a test 
is the following Reject 0 if and only if —2 log X >Xj_.(fc — 1), the (1 — «)th 
quantile of the chi-square distribution with ft — 1 degrees of freedom (Several 
other approximations to the distribution of the likelihood ratio statistic have 
been given, and some exact tests are also available ) 


5 CHI-SQUARE TESTS 

In this section we present a number of tests of hypotheses that one way or another 
involve the chi square distribution Included will be the asymptotic distribution 
of the generalized likelihood ratio, goodness of fit tests, and tests concerning 
contingency tables The material m this section will be presented with an aim 
of merely finding tests of certain hypotheses, and it will not be presented in such 
a way that concern is given to the optimality of the test Thus, the power 
functions of the derived tests will not be discussed 


5 1 Asymptotic Distribution of Generalized Likelihood-ratio 

On two occasions in Sec. 4 we found that the distribution of the generalized 
likelihood-ratio was intractable, and both times we indicated that an approximate 
test could be obtained by using an asymptotic distribution of the generalized 
likelihood ratio The following theorem, which we shall not be able to prove 
because of the advanced character of its proof, gives the asymptotic distribution 
of the generalized likelihood-ratio 

Tntorcm 1 Eet "A u , X„ 'ce a sample with joint density J Xt , x „ 
(♦ , ,0), where 0 = (0 lt , 0*), that is assumed to satisfy quite general 

regularity conditions Suppose that the parameter space 9 is ft-dimen- 
sional In testing the hypothesis 


J?o 0i = O?, 


,O, = 0®, 0,+ i, , 0 K , 
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where Of, . . . , Of are known and 0 f+1 , . . . , O k are left unspecified, -2 log A„ 
is approximately distributed as a chi-square distribution with r degrees of 
freedom when 2C 0 is true and the sample size n is large. //// 

Wc have assumed that 1 ^ r ^ k in the above theorem. I f r = Jt, then all 
parameters arc specified and none is left unspecified. The parameter space (5 is 
^•dimensional, and since 3c o specifies the value of r of the components of 
« • ♦ i the dimension of §q is k — r. Thus, the degrees of freedom of the 
asymptotic chi-square distribution in Theorem 7 can be thought of in two ways; 
first, as the number of parameters specified by and, second, as the difference 
in the dimensions of 5 and S 0 . 

Recall that A„ is the random variable which has values 

= sup L{0 X , . . . , Of. \ x x , . . ♦ , x^/sup L(0 X , . • • , j x x , . . . , 

which in turn is the generalized likelihood-ratio for a sample of size n. S 0 is 
that subset of S that is specified by . The generalized likelihood-ratio prin- 
ciple dictates that 3f 0 is to be rejected for l n small, but since -2 log X n increases 
as X n decreases, a test that is equivalent to a generalized likelihood-ratio test is 
one that rejects for —2 log large. Now, since the theorem gives an approx- 
imate distribution for the values -2 log X„ when 3f 0 is true, a test with approx- 
imate size a is given by the following: 

Reject 3f 0 if and only if -2 log X n > y\ _ a (r), 

where Xi-«( r ) is the (I — ce)th quantile of the chi-square distribution with r 
degrees of freedom. Note that the degrees of freedom r is the number of compo- 
nents of the parameter space that are specified by the null hypothesis. 

Because of the specific form of the null hypothesis in the theorem, it may 
appear that the result is not too widely applicable. The null hypothesis of the 
theorem specifies the values of a subset of the k components of the ^-dimensional 
parameter space, and not many null hypotheses are of that form. However, 
often the density can be rcparcmeicrized so that the null hypothesis is of the 
form given in the theorem. We illustrate with two examples. 


EXAMPLE 19 Recall that in Subsec. 4.3 we discussed testing 3f 0 : fi x = fi 2 , 
a\ > 0, a\ > 0 versus jP t : °> a l > °* where /*i and ** are 

the mean and variance of one normal population and ft 2 and a\ are 
the mean and variance of another. Here the parameter space is four- 
dimensional, and although does not appear to be of the form given 
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in Theorem 7, we can reparameterize to make it of that form Let 
0i “ Hi - Pi , «* Pi . 0j = a J* and 0+ = °i In terms of the reparam- 

eterization J? 0 becomes X 0 = 0j = 0, 0 2 , 0 )t 0 4 , that is, the com- 
ponent 0, is specified to be 0, and the remaining three components are 
unspecified The theorem is now applicable for the reparameterization, 
that is the asymptotic distribution of —2 log A is known (and is the chi- 
square distribution with one degree of freedom) for 3C 0 true, where A' is 
the generalized likelihood ratio obtained under the reparameterization 
However, because of the invariance property of maximum likelihood 
estimators, the generalized likelihood ratio A’ obtained under the reparam- 
etenzation is the same as the generalized likelihood ratio A obtained 
before reparameterization //// 


EXAMPLE 20 In Subsec 4 4 we tested a\ = Hu , H* , where 

Hj and c j were, respectively, the mean and variance of the yth normal 
population, 7=1, ,m (In Subsec 4 4, A: was used instead of m) If 
we make the following reparameterization, J^ 0 will have the desired form 
of Theorem 7 



Now becomes Jf 0 9 t = I, , 0 m _ x = 1, 0 m , 0„ +t , , 0 im , that is, 

the first m — 1 components are specified to be 1 and the remaining are 
unspecified Theorem 7 is now applicable, and, again, because of the 
invariance property of maximum likelihood estimates the generalized 
likelihood ratio obtained before and after reparameterization are the 
same, hence the asymptotic distribution of —2 log A, as claimed in 
Subsec 4 4, is the chi square distribution with m — 1 degrees of freedom 
when Jf 0 is true //// 


5.2 Chi-square Goodness-of-fit Test 

We commence this section with an example of a testing problem that involves 
the specification of the parameters of a multinomial distribution It is hoped 
that this example will help motivate the presentation of the goodness of fit test. 
If a population has a multinomial density 

/(*!. >Xk,Pl> ’ PJ “ UP'/’ ( 23 ) 
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where Xj = 0 or l,j= I, k + I; 0 £ Pj <; !,/= 1, .... k + 1; *£ Xj = 1; 


Jt+1 


J= i 


&nd X Pj ^ would be the case in sampling with replacement from a 

Jml 


population of individuals who could be classified into k + 1 classes or categories), 
a common problem is that of testing whether the probabilities pj have specified 
numerical values. Thus, for instance, the result of casting a die may be classified 
into one of six classes, and on the basis of a sample of observations we may wish 
to test whether the die is true, that is, whether pj « £ for/ = 1, . . . , 6. One can 
also think in terms of independent, repeated trials, where each trial can result in 
any one of k 4- 1 outcomes, called classes or categories. The density in Eq. (23) 
then gives the density for the outcome of one trial. The result of one trial can 
be represented by the multivariate random variable (X u X k ) y where X } is 
unity if the trial results in category; and is 0 otherwise, pj is the probability 
that a trial results in category j. Now if we independently repeat the trial n 
times, we haven observations of the multivariate random variable {X u X k ); 
we can display them as 


(vVj i, . . . , Vjfc), {X21, • • * , Xik)i • * * > C^nl> % flfc ;)• 


If we let Nj = V Xn , then the random variable Nj is the number of the n trials 
1 

resulting in category j. We know that (N it . . . , N k ) has a multinomial distribu- 
tion. (See Example 5 in Subscc. 2.2 of Chap. IV.) 

To test the null hypothesis Pj = PjJ= 1» •••.* + 1» where Pj are 
given probabilities summing to unity, we hope to employ the generalized 
likelihood-ratio principle. The likelihood function is given by 


L = L(pi, * * * » pk 7 • ♦ •» 




n k+l 

n Up?*- (24) 

1=1 


The parameter space 9 has k dimensions (given k of the k *f 1 Pj s, the remaining 
one is determined by £ Pj — 1)> while 9 0 is a point. It is readily found that L is 
maximized in 9 when 


Pj -& n 




where ?ij is a value of the random variable Nj . Hence, 
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The maximum of L over 0 O is its only value [J (p° and so the generalized 
4=1 

likelihood ratio is 



A generalized likelihood ratio test is given by the following Reject Jf 0 if and 
only if A < A 0 , where the constant A 0 is chosen to give the desired probability of 
a Type I error For small n the distribution of the generalized likelihood ratio 
may be tabulated directly m order to determine A 0( for large values of n, we may 
use Theorem 7, which states that —2 log A has approximately the chi square 
distribution with k degrees of freedom The chi square approximation is sur- 
pnsmgly good even if n is small provided that k >2 

Another test which is still commonly used for testing was proposed (by 
Karl Pearson) before the general theory of testing hypotheses was developed 
This test uses the statistic 

(25 > 

which tends to be small when is true and large when is false Note that 
Nj is the observed number of trial outcomes resulting in category J and npj is the 
expected number when is true It can be easily shown (see Prob 39) 
that 

<?[£>*] *= £ “o [np/1 ~ pj) + n 2 (Pj - pj) 1 ], (26) 

4-1 n Pj 

where the pj are the true parameters If jf 0 is true, then c?IC*] 8:1 X 0 — pj) - 
k + 1 — 1 = k The following theorem gives a limiting distribution for Q° when 
the null hypothesis Jf 0 1S true 

Theorem 8 Let the possible outcomes of a certain random experiment 
be decomposed into k + 1 mutually exclusive sets, say A lt , A k+i 
Define p t = P[Aj[,j=* I, , k + I In n independent repetitions of the 
random experiment, let Nj denote the number of outcomes belonging to 

*+i 

sctA Jt j=l, , k + 1, so that £ Nj = n Then 

= (27) 

4-1 «Pj 

has as a limiting distribution as n approaches infinity, the chi square 
distribution with k degrees of freedom //// 
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We will not prove the above theorem, but we will indicate its proof for 
k ~ 1. What needs to be demonstrated is that for each argument x , Fg k (x) 
converges to F x t^(x) as n-^co, where Fg^*) is the cumulative distribution 
function of the random quantity Q k and F^g^*) is the cumulative distribution 
function of a chi-square random variable having k degrees of freedom. (Note 
that k + 1, the number of groups, is held fixed, and n 9 the sample size, is increas- 
ing.) lf/c~l,then 


G* = 2i = 


(N i - npi ) 1 ^ (N 2 - np 2 ) 2 _ (JVt - np t ) 2 
npx np 2 np t 

{ (n-N 1 -n+ npQ 2 _ (N x - npQ 2 
n (l “ Pi) «Fx(l- Fx)‘ 


We know that iVi has a binomial distribution with parameters n and p t and that 
Y„ = (N 2 — npi)/>/nPi(l — Pi) has a limiting standard normal distribution; 
hence, since the square of a standard normal random variable has a chi-square 
distribution with one degree of freedom, we suspect that Y% = Q x has a limiting 
chi-square distribution with one degree of freedom, and such can be easily shown 
to be the case, which would give a proof of Theorem 8 for k = 1. 

Theorem 8 gives the limiting distribution for the statistic 




when the null hypothesis 3^ 0 : Pj = Pj,j ' = 1, •••, k + 1, is true. Thus a test of 
jf 0 : Pj = j = 1 , . . . , k + 1 , which has approximate size a, is given by the 
following: 

Reject if and only if Ql > yX-JF), 


the (1 - a)th quantile of the chi-square distribution with k degrees of freedom. 
We now have two large-sample tests of the null hypothesis 3?$. Pj — Pj » 
i — 1, . . . , k + I, the one just defined, which uses Theorem 8, and the other given 
in terms of the generalized likelihood-ratio, which uses Theorem 7. It can, in 
fact, be shown that the two tests are equivalent for large samples. 


EXAMPLE 21 Mendelian theory indicates that the shape and^ color of a 
certain variety of pea ought to be grouped into four groups, “round and 
yellow,” “round and green,” “angular and yellow,” and “angular and 
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green,” according to the ratios 9/3/3/1 For « = 556 peas, the following 
were observed (the last column gives the expected number) 

Round and yellow 315 312 75 

Round and green 103 104 25 

Angular and yellow 101 104 25 

Angular and green 32 34 75 


A size 05 test of the null hypothesis jf a p L =-fo, Pint’s, Pj»-j^* 
and Pi - tV is given by the following 

Reject X’d if and only if Ql = £ — — exceeds yj_«(k) ** x 2 „(3) « 

i rpj 

781 


The observed Q\ is 

(315 - 312 75) 1 (103 - 104 25) J (101 - 104 25) J (32 - 34 75) 1 

312 75 104 25 + J04 25 + 34 75 

a 470, 

and so there is good agreement with the null hypothesis, that is there is a 
good fit of the data to the mode! //// 


Theorem 8 can be generalized to the case where the probabilities p } may 
depend on unknown parameters The generalization is given m the next 
theorem 


Theorem 9 Let the possible outcomes of a certain random experiment 
be decomposed into k + 1 mutually exclusive sets, say A t , , A lt . t 
Define Pj~P[A t ] j*= 1, , k+ 1, and assume that p t depends on r 

unknown parameters 0 t , , 0 r , so that p } = , 0 r )> J ** l» * 

k + 1 In n independent repetitions of the random experiment. let Nj 
denote the number of outcomes belonging to set Aj,j - 1, , k + 1, so 


i + i 

that £ Nf = n Let O s , , 0, be BAN estimators (e g , maximum- 
j-i 

likelihood estimators) of 0,, , 0, based on N lt , N k Then, under 

certain general regularity conditions on the /i/s. 


y (Nj ~ nP'Y 


has a limiting distribution that is the chi square distribution with k — r 
degrees of freedom where P } - //O lf , *» 1 , , k + 1 III I 
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The proof of Theorem 9 is beyond the scope of this book. The limi ting 
distribution given in Theorem 9 differs from the limiting distribution given in 
Theorem 8 only in the number of degrees of freedom. In Theorem 8 there are k 
degrees of freedom, and in Theorem 9 there are i - r degrees of freedom; the 
number of degrees of freedom has been reduced by one for each parameter that 
is estimated from the data. 

No mention of hypothesis testing is made in the statement of Theorem 9. 
However, we will show now how the results of the theorem can be used to obtain 
a goodness-of-lit test. Suppose that it is desired to test that a random sample 
Xu •• •> X n came from a density /(.v; 0 lt . . ., 0 r ), where 0 l , . . ., Q r are unknown 
parameters but the function ./"is known. The null hypothesis is the composite 
hypothesis JP 0 : X, has density /(x; 0 „ .... 0 r ) for some 0 U 6 r . The null 
hypothesis states that the random sample came from the parametric family of 
densities that is specified by/(* ;0 k , 0 r ). If the range of the random variable 
X { is decomposed into k + 1 subsets, say A t , . . A k+1 , if pj = P[X t e Aj\, and if 
Nj = number of X t 's falling in Aj, then, according to Theorem 9, 


t+HNj-nPf 
Qk ~ t ' nPj 


is approximately distributed as the chi-square distribution with k — r degrees of 
freedom if n is large and is true, where Pj — ///Oj , ..., 0 r ) and 0, is a 
maximum-likelihood estimator of 0 f , /= 1, ..., r, obtained from the statistics 
N u .... N l . {Note that / fj (0 t , ..., 0 r ) = P[X l eA J ], which for a continuous 
random variable X, equals J^/fx; 0 t , 0 r ) dx.} Hence, a test of Xf 0 can be 
obtained by rejecting if and only if the statistic Q' k is large; that is, reject 
if and only if Q' k exceeds rf-Jk - r), where xi- a (k - r) is the (1 - a)th 
quantile of the chi-square distribution with k — r degrees of freedom. Such a 
test is called a goodness-of-fit test since it tests whether or not the observations 
.... x„ fit, or are consistent with, the assumption that they are observations 
from the density /(.v; 0 t , 0 r ). 

In the above, the 0, , for / = 1 , . . . , r, were estimated by using the statistics 
N u ...,N k rather than X u .... X„. The statistics N u ...,N k give the number 
of x observations falling in each of the A s subsets or groups. In practice, often 
the values of the ^,’s are not recorded, and then the group totals N u ...,N k 
constitute the available information. If, however, the observations X u . . . , X n 
were available, then one could estimate 6{,i— 1 , . « . , r, more efficiently by using, 
say, maximum-likelihood estimators based on Ji •••» X„. When such esti- 
mators are used, the limiting distribution of Q' k is no longer a chi-square distribu- 
tion with k -r degrees of freedom; instead, the limiting distribution of Q' k is 
bounded between a chi-square distribution with k-r degrees of freedom and a 
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K 

chi square distribution with k degrees of freedom In a sense, some of the 
“lost ’ r degrees of freedom are recouped by efficiently estimating 0„ , 0 r 

For a proof of Theorem 9 and further discussion of the above the reader is 
referred to Kendall and Stuart [14] 

EXAMPLE 22 Suppose it is desired to test the hypothesis that an observed 
random sample has been drawn from some normal population 

Let the n sample values x lt , x„ be grouped into k + 1 classes For 
example, the jth class could be taken as all those observations falling in the 
interval (Zj-i , Zj], J=l, , k + 1, for some z 0 <z l <z 1 < <r t 
< **+i»vhere z 0 = — co and 2 t+1 = + oo Then 

Pi = «*) * j t ' 4>, Ax) dx = " ) 

Let p. and & be the maximum likelihood estimates of p and a based on 
n, , , n* , where n ; is the number of observations falling in the jth interval 
Then, 

can be determined from the sample, and so the value 

, »tj ) 2 

2 ^ a 

J-x n ?J 

of Q k can also be obtained from the sample The hypothesis that the 
sample came from a normal population would be rejected at the a lever if 
q k >xl-&-2) If. on the other hand, p and <r were obtained from 
maximum likelihood estimators based on X lt , X n , then the asymptotic 
distribution of Q k would fall between a chi square distribution with k — 2 
degrees of freedom and a chi square distribution with/: degrees of freedom 
The hypothesis would be rejected if q‘ k > C> where C falls between 
- 2) and yj_ 4 (k) Note that for k large there is little difference 
between - 2) and X f„ t (k) //// 

5 3 Test of the Equality of Two Multinomial 
Distributions and Generalizations 

A problem that is of great practical importance is that of testing whether several 
random samples can be considered as drawn from the same population For 
instance, in Subsec 4 3 we tested whether several assumed normal populations 
could be considered the same normal population In this subsection we first 
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indicate a test of the hypothesis that two multinomial populations can be con- 
sidered the same and then indicate some generalizations. Suppose that there are 
k + l groups associated with each of the two multinomial populations. Let the 
first population have associated probabilities p n , Pl2 , ..., Plk> Pl>Jt+1 and the 
second p 2 i> P 22 > • • • > P 2 k>p 2 ,k+i- It is desired to test ff 0 : PlJ = PlJ (= Pj , say), 
j = l, k + 1. For a sample of size n t from the first population, let N 3 j 

denote the number of outcomes in group j, j = 1 k + 1. Similarly, let N 2J 

denote the number of outcomes in group j of a sample of size n 2 from the second 
population. (Here we are assuming that the sample sizes n 1 and n 2 are known.) 
We know that 


M n iPu 

has a limiting chi-square distribution with k degrees of freedom for / = 1 and 2; 
hence 


f (N u - n lPij y 
i=i J= i " iPij 

has a limiting chi-square distribution with 2k degrees of freedom if the two 
random samples are independent. If Pf 0 is true, then 


8a=ZE (JV|j - — 

i=ij=i n iPj 


(29) 


has a limiting chi-square distribution with 2k degrees of freedom. If 
specifies the values pj, then Q 2k is a statistic and can be used as a test statistic. 
On the other hand, if the Pj defined by Jif 0 are unknown, then they have to be 
estimated. If SfC 0 is true, the two samples can be considered as one random 
sample of size n x + n 2 from a multinomial population with probabilities Pu . . . , 
p k+l . Maximum-likelihood estimators of the Pj are then (iV iy + N 2 j)l(riy + n 2 ), 
j = 1, and if the Pj in Eq. (29) are replaced by their maximum-likelihood 

estimators, we then obtain 


n' - V V 1 Wu-nWu + NjjMn i + n 2 )] 2 
^ 2k ffiM n l (N lJ ^N 2 j)l(n 1 +n 2 ) 

It can be shown that Q 2k has a limiting chi-square distribution with 2 k — k- k 
degrees of freedom. (This result is not a direct corollary of Theorem 9; it 
would, however, be a corollary of a generalization of Theorem 9 from one to 
two populations.) Again the degrees of freedom of the limiting distribution of 
Q' lk have been reduced by unity for each parameter estimated. 
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Another test of the homogeneity of two multinomial populations can be 
derived by finding the generalized likelihood ratio A and employing Theorem 7 
to obtain the limiting distribution of - 2 log A (Reparametemation is required 
before Theorem 7 can be employed directly ) The details of finding such a test 
are left as an exercise 


EXAMPLE 23 In an opinion survey regarding a certain political issue there 
was some question as to whether or not the eligible voters under 25 years 
of age might view the issue differently from those over 25 Fifteen hundred 

individuals of those over 25 were interviewed, and 1000 of those under 25 
were interviewed with the following results {the data are obviously arti 
ficial to facilitate calculations) 


Under 25 


Over 25 


Total 


Opposed 

400 


600 


1000 


Undecided 


100 


500 




500 


500 


1000 


Total 


1000 


1500 


2500 


Test the null hypothesis that there is no evidence of difference of opinion 
due to the different age grouping that is test p ti =* p 2 j - Pj J - 1. 
2 3 pi and p 2 need to be estimated We can calculate the value of the 
statistic given in Eq (30) as follows 

(400- 1000 1 000/2500) 1 , (100- 1000 500/2500) 2 
1000 1000/2500 + 1000 500/2500 

, (500-1000 1 000/2 500) 2 (6)0 - 1500 1000/2500) 2 

+ 1000 1000/2500 +_ 1500 1000/2500 

(400- 1500 500 2500) 2 (500-1500 1000/2500)* 

+ 1500 500/2500 + 1500 1000/2500 “ 2 

The 99 percent quantile point Tor the chi square distribution with two 
degrees of freedom is only 9 2 1 , so there is strong evidence that the two age 
groups have different opinions on the political issue //// 


The technique presented in this subsection can be generalized in two direc* 
tions First a test of the homogeneity of several, rather than just two multi 
nomial populations can be obtained and second a test of the hypothesis that 
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several given samples are drawn from the same population of a specified type 
(such as the Poisson, the gamma, etc.) can be obtained using a procedure similar 
to that above. We illustrate with an example. 


EXAMPLE *.4 One hundred observations were drawn from each of two Poisson 
populations with the following results: 



0 

1 

2 

3 

A 

5 

6 

7 

8 

9 or more 

Total 

Population I 

11 

25 

28 

20 

9 

3 

3 

0 

1 

0 

100 

Population 2 

13 

27 

28 

17 

11 

1 

2 

1 

0 

0 

100 

Total 

24 

52 

56 

37 

20 

11 


200 


Is there strong evidence in the data to support the contention that the two 
Poisson populations are different? That is, test the hypothesis that the 
two populations are the same. This hypothesis can be tested in a variety 
of ways. We first use the chi-square technique mentioned above. We 
group the data into six groups, the last including all digits greater than 4, as 
indicated in the above table. If the two populations are the same, we have 
to estimate one parameter, namely, the mean of the common Poisson 
distribution. The maximum-likelihood estimate is the sample mean, 
which is 

0(24) + J(S2 } 4- 2(56) + 3(37) + 4(2 0) + 5(4) + 6(5) + 7(1) + 3(1) 

200 



The expected number in each group of each population is given by 


0 

1 

2 

3 

4 

5 or more 

12.25 

25.72 

27.00 

18.90 

9.92 

6.21 


The value of the statistic in Eq. (29), where n t pj is replaced by the estimates 
given in the above table, can be calculated. It is approximately 1.68. 
The degrees of freedom should be 2k — 1 (one parameter is estimated), 
which is 9. The test indicates that there is no reason to suspect that the 
two assumed Poisson populations are different Poisson populations. //// 
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DC 


We mentioned earlier that there are several methods of testing the null 
hypothesis considered here For example the generalized likelihood ratio prin- 
ciple and employment of Theorem 7 yield a test that the student may find in 
structive to find for himself 


5 4 Tests of Independence In Contingency Tables 

A contingency table is a multiple classification, for example, in a public opinion 
survey the individuals interviewed may be classified according to their attitude 
on a political proposal and according to sex to obtain a table of the form 


This is a 2 x 3 contingency table The individuals are classified by two criteria, 
one having two categories and the other three categories The six distinct 
classifications are called cells A three-way contingency table would have been 
obtained had the individuals been further classified according to a third criterion, 
say, according to an annual income group If there were five income groups 
set up (such as under $2000, $2000 to $4000, ), the contingency table would be 

called a 2 x 3 x 5 table and would have 30 cells into which a person might be 
put It is often quite convenient to think of the cells as cubes in a block two 
units wide, three units long, and five units deep If the individuals were still 
further classified into eight geographic locations, one would have a four way 
(2 x 3 x 5 x 8) contingency table with 240 cells in a four dimensional block 
with edges two, three, five and eight units long A contingency table provides 
a convenient display of the data for ultimately investigating suspected relation 
ships Thus one may suspect that men and women will react differently to a 
certain political proposal, m which case one would construct such a table as the 
one above and test the null hypothesis that their attitudes were independent of 
their sex To consider another example, a geneticist may suspect that suscep- 
tibility to a certain disease is heritable He would classify a sample of individuals 
according to (i) whether or not they ever had the disease, (ii) whether or not their 
fathers had the disease, and (ni) whether or not their mothers had the disease 
In the resulting 2x2x2 contingency table he would test the null hypothesis 
that classification (i) was independent of (ii) and (m) Again a medical research 
worker might suspect a certain environmental condition favored a given disease 
and classify individuals according to (i) whether or not they ever had the disease. 



5 


CHI-SQUARE TESTS 453 


(ii) whether or not they were subject to the condition. An industrial engineer 
could use a contingency table to discover whether or not two kinds of defects in 
a manufactured product were due to the same underlying cause or to different 
causes. It is apparent that the technique can be a very useful tool in any field 
of research. 

Two-way contingency tables We shall suppose that n individuals or items are 
classified according to two criteria A and B, that there are r classifications 
> • • • » in A and s classifications B it i? 2 , . . . , B s in B, and that the number 
of individuals belonging to At and Bj is Ny . We have then an r x s contingency 
table with cell frequencies N l} and £ Ny = n: 


( 31 ) 


As a further notation we shall denote the row totals by N,_ and the column totals 
by Ny; that is, 

N,. = £ N,j and N, j = J j N ij . 
j 1 

Of course, 

t J 

We shall now set up a probability model for the problem with which we 
wish to deal. The n individuals will be regarded as a sample of size n from a 
multinomial popuiation with probabilities (/« 1, 2, ...» r; j ~ ] , 2, s). 

The probability density function for a single observation is 

f(x n , x 12 Pit, • • • > Pn) = n Pfj J > (32) 

i* J 

x,j = 0 or 1 and !>,•; == 1. 



where 
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We wish to test the null hypothesis that the A and B classifications are independ- 
ent, i e , that the probability that an individual falls in B } is not affected by the 
A class to which the individual happens to belong Using the symbolism of 
Chap I, we would write 

P[Bj\A,\ = P[Bj] and P[A , | B,] = /*[/!,] 

or 

PlA t nBj] = PlA t }P[Bj) 

If we denote the marginal probabilities P[A{[ by p t (i = 1, 2, , r) and the 

marginal probabilities P[Bj] by p j (j = 1 , 2, , j), the null hypothesis is simply 

Pu^PiPj, Y,Pt ( 33 ) 

When the null hypothesis is not true, there is said to be interaction between the 
two criteria of classification 

The complete parameter space <5 for the distribution of JV,„ , ff r> has 
rs — 1 dimensions (having specified all but one of the p t) , the remaining one is 
fixed by £ p u = 1) while under we have a parameter space 5 0 with 

r— l + s- 1 dimensions (The null hypothesis is specified by p t , i = J, , r, 
and p j, ]= I, , s, but there are only r - 1 +s~ 1 dimensions because 
£ p t = 1 and £ p j ** I ) The likelihood for a sample of size n is 

and its maximum in S occurs when 



(34) 

and its maximum occurs at 

h " and = ^ (35) 

The generalized likelihood ratio is therefore 



( 36 ) 
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The distribution of A under the null hypothesis is not unique because the hy- 
pothesis is composite and the exact distribution of A does involve the unknown 
parameters p,-. and py hence, it is very difficult to solve for;. 0 in sup /^[A < A 0 ] = 

Go 

a. For large samples we do have a test, however, because -2 log A is in that 
case approximately distributed as a chi-square random variable with 

rs - 1 — (r + s — 2) = (r — 1 )(s - 1) 

degrees of freedom and on the basis of this distribution a unique critical region 
for A may be determined. The degrees of freedom rs — 1 — (r + s — 2) is 
obtained by subtracting r + s - 2, which is the dimension of 0 O , from rs — 1, 
which is the dimension of 0. Also, ( r — l)(r — 1) is the number of parameters 
specified by . (See Theorem 7 and the comment following it.) Actually, 
the null hypothesis p tj = Pi.p.j is not of the form required by Theorem 7; 
so.it might be instructive to consider the necessary reparameterization. For 
convenience, let us take r -s = 2. Now © = {(9 t , 6 Z , 0 3 ) = (p tl , p i2 , Pn)' 
Pu 2: 0; p 12 > 0; p 2l > 0; and p u + p i2 + p n ^ !}• Let ©' with points 
(0\,0' 2 , 0' 3 ) denote the reparameterized space, where 0[ = p u - p x .p. „ B' z - p u , 
and = p j . It can be easily demonstrated that 0' is a one-to-one transforma- 
tion of S. Also, the null hypothesis p n = P 12 — Pi.( 1 — P. 1 ), an( i 

p n = (1 — pi)p, 1 in the original parameter space © becomes 0[ = 0 and 
0' 2 and 0' 3 unspecified in the reparameterized space ©'. [Note that p i2 — 
p j (I — p j) is equivalent to p,. — Pn = Pi.(l — P.i)> which is equivalent to p u — 
Pi.P.i = 0- Similarly for p 2I = (1 - pJp.iJ ^0 is of the form required by 
Theorem 7. In general, a point in the (rs - l)-dimensional parameter space 0 
can be conveniently displayed as 

Pn P12 

P21 P22 

Pr- 1,1 Pr- 1.2 

Prl Pr2 

and a point in the reparameterized space ©' can be displayed as 

Pl 2 Pl.P -2 
P 22 ~~ P 2 .P .2 

Pr-l,2-Pr-U-P.2 
P .2 


Pn — Pi.P.i 
P21 “ P2.P.1 

Pr- 1, 1 “ Pr-l,.P.l 
Pa 


* ■ . 

Pl,s-1 Pi. P.,5-1 

Pi. 


P 2 , 5 — 1 “ P2.P..5-1 

Pz. 


Pr- l,s-l “ Pr-1,. P.,5-1 

Pr-1„ 

• •• 

P.,5- 1 



Pi, 5-1 

Pl 5 

. . • P2, s-1 

P2s 

Pr-1, 5-1 

Pr-1, 5 

• * • Pr,5— 1 
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In casting about for a test which may be used when the sample is not large, 
w e may inquire how it is that a test criterion comes to have a unique distribution 
for large samples when the distribution actually depends on unknown param- 
eters which may have any values in certain ranges The answer is that the 
parameters are not really unknown, they can be estimated, and their estimates 
approach their true values as the sample size increases In the limit as n becomes 

infinite, the parameters are known exactly, and it is at that point that the dis- 
tribution of A actually becomes unique It is unique because a particular point 
in IS selected as the true parameter point, so that the N tJ are given a unique 
distribution and the distribution of A is then determined by this distribution 
It would appear reasonable to employ a similar procedure to set up a test 
for small samples, i e , to define a distribution for A by using the estimates for 
the unknown parameters In the present problem, since the estimates of the 
p , and p j are given by Eq (35), we might just substitute those values in the 
distribution function of the N (J and use the distribution to obtain a distribution 
for A However, we should still be m trouble, the critical region would depend 
on the margmaltotals and N y , hence the probability of a Type I error would 
vary from sample to sample for any fixed critical region 0 < X < Xq 

There is a way oat of this difficulty, which is well worth investigation 
because of its own interest and because the problem is important in applied 
statistics Let us denote the joint density of all the N tJ briefly by /(« tf ), the 
marginal density of all the N t and Nj by ff(n, , n j) and the conditional density 
of the N tJ , given the marginal totals, by 


/(««!»! . 


n V- /(">/) 

* S(»i * « j) 


Under the null hypothesis this conditional distribution happens to be inde- 
pendent of the unknown parameters (as we shall show presently), the estimators 
N t Jn and N jjn form a sufficient set of statistics for the p t and p j This fact 
will enable us to construct a test 

The joint density of the N tJ is simply the multinomial distribution 


M/) =/(«i i. n ia , ,n r ) = n ft/ (37) 

11 

n?f and in @ 0 (we a re interested in the distribution of A under ^ becomes 






rc?(9*)(P 


pV) 


(33) 


To obtain the desired conditional distribution, we must first find the distribution 
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of the N t . and Nj, and this is accomplished by summing Eq. (38) over all sets 
of n XJ such that 

E n U = n .j and E n w = »,.. (39) 

For fixed marginal totals, only the factor 1 lYi n i/- * n Eq. (38) is involved in 
the sum, so we have, in effect, to sum that factor over all /jy subject to Eq. (39). 
The desired sum is given by comparing the coefficients of n *?'• in the expression 

i 

+ • • * + *r)"' 2 — (*i + — + X r f- = ( Xl + • * * + X r )\ (40) 
On the right-hand side the coefficient of J] is simply 

nl 


ir»* 

i 

On the left-hand side there are terms with coefficients of the form 


n.i! 


n . IW 

Li* — j 


n "n ! n «iz ! n« b ! n n v!’ 

I l i i, j 


(41) 


(42) 


where n {J is the exponent of x ( in the Jth multinomial. In this expression the 
n X j satisfy conditions of Eq. (39); the first condition is satisfied in view of the 
multinomial theorem, while the second is satisfied because we require the ex- 
ponent of X( in these terms to be n L . The sum of all such coefficients, Eq. (42), 
must equal Eq. (41); hence, we may write 


nl 


' n n u- 


(UniMIn.j'-) 


(43) 


This is precisely the sum that we require because there is obviously one and only 
one coefficient of the form of Eq. (42) on the left of Eq. (40) for every possible 
contingency table, Eq. (31), with given marginal totals. The distribution of the 
N l and N j is, therefore, 

, , (”P 2 

( IROOWO 


ai tiffin, ( 44 ) 


which shows, incidentally, that the N are distributed independently of the N j 
under 3^ o', this is unexpected because N it and N_i , for example, have the 
random variable N n in common! 

The conditional distribution of the N l} , given the marginal totals, is 
obtained by dividing Eq. (38) by Eq. (44) to obtain 


/( n ll> w 12 > •••» n r: 


Ml., ^2. s ‘ * * * n.s) 


<TKP<nM 

nlU.rUj\ 


(45) 
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which, happily, does not involve the unknown parameters and shows that the 
estimators are sufficient 

To see how a test may be constructed, let us consider the general situation 
in which a test statistic A for some test has a distribution / a (A, 0) which involves 
an unknown parameter 0 If 0 has a sufficient statistic, say T, then the joint 
density of A and T may be written 

AT&t,e)=f AlT mfAt>0), 

and the conditional density of A given T, will not involve 0 Using the condi- 
tional distribution, we may find a number, say A 0 (r), for every / such that 
.MO 

J .Miioa-os, PS) 

for example In the Xt plane the curve X « A 0 (r) together with the line X ■* 0 
will determine a region R See Fig 7 The probability that a sample will give 
rise to a pair of values (A, /) which correspond to a point in R is exactly 05 
because 

P[(A, T) e R] = £ f!f A AK U6)dX dt 
r « r .mo 1 

=J_ w [J o MmdXyAt,B)dt 
= J>*.*>* 

= 05 

Hence we may test the hypothesis by using T in conjunction with A The 
critical region is a plane region instead of an interval 0 < X < X 0 , it is such a 
region that, whatever the unknown value of 0 may be, the Type I error has a 
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specified probability. The test in any given situation actually amounts to a 
conditional test; we observe T and then perform the test by using the interval 
0 < /. 0 (r) using the conditional distribution of A, given T. It is to be observed 

that this device cannot be employed unless there is a sufficient statistic for 6 . 

The above technique is obviously applicable when 0 is a set of parameters 
rather than a single parameter and has a set of sufficient statistics. In particular, 
the technique may be employed to test the null hypothesis of a two-way con- 
tingency table using Eq. (36) to define X. One merely uses the conditional 
distribution of Eq. (45) and determines an interval 0 < 2 < /. 0 (n L ; n tJ ) which has 
the desired probability of a Type I error for the observed marginal totals. 

In applications of this test one is confronted with a very tedious computa- 
tion in determining the distribution of A unless r, s, and the marginal totals are 
quite small. It can be shown, however, that the large-sample approximation 
may be used without appreciable error except when both r and s equal 2. In 
the latter instance, other simplifying approximations have been developed (see, 
for example, Fisher and Yates, “Tables for Statisticians and Biometricians,” 
Oliver & Boyd Ltd., Edinburgh or London, 1938), but we shall not explore the 
problem that far. 

Another test of the yf 0 given in Eq. (33) is obtained if the distribution in 
Eq. (45) is replaced by its multivariate normal approximation since then it can 
be shown that the statistic 


„ f47) 
~ h K > 

has approximately the chi-square distribution with rs — 1— (r— 1 + s — 1) = 
(r _ j)(j _ i) degrees of freedom. The test criterion is to reject 3f 0 for large Q. 
This is the criterion first proposed (by Karl Pearson) for testing the hypothesis, 
and it differs from -2 log ?. by terms of order 1 jjn. The two criteria are 
therefore essentially equivalent unless n is small. The argument that Q is a 
reasonable test statistic is entirely analogous to that used in Subsec. 5.2 above to 
justify Eq. (25). The statistic Q of Eq. (47) has intuitive appeal. N u is the 
observed number in the 0th cell, and n(NJn)(NJn) is an estimator of the expected 
number in the yth cell when is true. Thus, Q will tend to be small for 
true and large for .?fo false. 

Three-way contingency tables If the elements of a population can be clas- 
sified according to three criteria A, B, and C with classifications A, (i = 1, 2, . . . , 
5l ), B,(J— 1, 2, .... s 2 ), and C k (k = 1,2, .... s 3 ), a sample of n individuals may 
be classified in a three-way ^ x s 2 x contingency table. We shall let p lJk 
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represent the probabilities associated with the individual cells and N lJk be the 
numbers of sample elements in the individual cells, and as before marginal 
totals will be indicated by replacing the summed index by a dot, thus 

and (48) 

J-i 1-11*1 

There are four hypotheses that may be tested in connection with this table 
We may test whether all three criteria are mutually independent, in which case 
the null hypothesis is 

Put - Pi P.j ? x, (49) 

where p t = EEPu»* Pj «EEPu** ^ P-t^EEPut. or we nay test 

j k i * i j 

whether any one of the three criteria is independent of the other two Thus to 
test whether the B classification is independent of A and C, we set up the null 
hypothesis 

Put - PixP.j i (50) 

where p, » = £ Put 
j 

The procedure for testing these hypotheses js entirely analogous to that 
for the two way tables The likelihood of the sample is 

^riPuf. (so 

where 

E Put = i and E n uk ** n 

i j * < j t 

In S the maximum of L occurs when 



so that 

s “p l - i n (52) 

S n t ft 

To test the null hypothesis in Eq (50), for example, we make the substitution 
of Eq (50) into Eq (51) and maximize L with respect to the p, t and p j to find 

and 

and 

V L -£(n*j(pf) w 
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The generalized likelihood-ratio A is given by the quotient of Eqs. (52) and (53), 
and in large samples —2 log A has the chi-square distribution with 

S 1 S 2 S 3 ~ 1 - 1(^3 - 1) + S 2 - 1] = ( Sl S 3 - 1 ){s 2 - 1) 

degrees of freedom. Again the large-sample distribution is quite adequate for 
many purposes. (s l s 3 — 1) 4- ( s 2 — 1) is the dimension of S 0 , and s 1 s 2 s 2 — 1 is 
the dimension of S. 

A test statistic analogous to that given in Eq. (47) for testing independence 
in a 2 x 2 contingency table can also be derived. For testing A and C 
classifications are independent of the B classification, such a test statistic is 

Q = I £ £ - J v TTTT^ • ( 54 ) 

i J * n{NuJn){NjJn ) 

Under Q has an asymptotic chi-square distribution with s t s 2 s 3 “1 

— (^1^3 — 1) — (s 2 — 1) — (s x s 3 — 1)(j 2 — 1) degrees of freedom. Again, the 
statistic Q of Eq. (54) has intuitive appeal since N ( j k is the observed number in 
cell ijk and n(N t jJn)(N jjn) is an estimator of the expected number when 
is true. 

6 TESTS OF HYPOTHESES AND 
CONFIDENCE INTERVALS 

In Subsec. 3.4 above we noted that a confidence interval for a unidimensional 
parameter 9 could be used to obtain a test of 9 = 0 O versus \ 9 # 9 0 . 
In this section we will further explore that concept and show that one can 
reverse the operation; that is, one can use a family of tests of 0 = 9 0 versus 

0#0 O (the family is generated by varying 9 0 ) to obtain a confidence 
interval for 6 . Our considerations in this section will not be very thorough; our 
intent is merely to present an introduction to the usefulness of the close relation- 
ship between hypothesis testing and confidence intervals. 

Our discussion can be made somewhat more general if we speak in terms 
of confidence sets rather than confidence intervals . As usual, let £ denote the 
sample space, § the parameter space, and (x u ..., * n ) the observed sample. 

Definition 18 Confidence set A family of subsets of the parameter 
space § indexed by (x u x n )e£, denoted by 3 = {0(x* 1 , x„ ): 
Sfo, . . . , x„) c g; (x i9 . . . , jO e £}, is defined to be a family of confidence 
sets with confidence coefficient y if and only if 

P 0 [<5(X i, . . . , X n ) contains 6] = y for all 9 e <5. (55) 
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It should be emphasized that any member, say §(jr,, ♦ *„)» of the family 

of confidence sets is a subset of 0, the parameter space , A 1 ,) is a 

random subset, for any possible value, say (x lt , x K ), of (X lt , AT,), 
0(Aj, .A",) takes on the value 0(Xj, , x,) a member of the family 3 To 

aid in the interpretation of the probability statement in Eq (55) note that for a 
fixed (yet arbitrary) 6 ' §(Aj , A 1 ,) contains 0 " is an event (it is the event that 
the random interval . X.) contains the fixed 0] and the 0 that appears 

as a subscript m P t is the 0 that indexes the distribution of the A',*s appearing 
mg(*i, ,*„) 

For instance, suppose X t , , X„ is a random sample from N(9, 1) 
- oo < 0 < oo> Let the subset 0(x t , , x M ) be the interval 

(x - zlsfn x + zf>Jn) where z is given by C>(r) - d>(-z) = y, then the family of 
subsets 9 = {0(x„ , x,) 5(*i, , *„) = (x - zf^/n, x + z}y/n)} is a family 

of confidence sets with a confidence coefficient y since 

?,(§(*„ , X n ) contains 0} « < 0 < X + -^-1 

l >/n yjnl 

= P e [— r < -r <z\ -y for all 0 e § 

L lly/tt J 

The family 9 is a family of confidence intervals for 0 having a confidence 
coefficient y In general, then a confidence interval is an example of a 
confidence set 

Confidence sets can be constructed from tests of hypotheses as we now 
show Let T #0 be a size a test (nonrandomized) of the null hypothesis Jf 0 
0 = 0 O , and let \(0 O ) be the acceptance region of the test Y ff „ [The acceptance 
region is the set complement of the critical region, that is, if the critical region is 
given by C(0 O ). then X(0 0 ) - X- C(0 O ) J Note that T(0 O ) is a subset of * 
indexed by 0 O Since the test Y So has size a, 

PM> , X„) e L(0 O )] = 1 — a 

If we now vary 0 O over 0 and for each 0 O we have a test Y fo , then we get a 
family of acceptance regions namely, {^(0 O ) 0 o eB} T(0 O ) is the acceptance 
region of test Y #a One can now define 

B(*„ , xj ~ {0 O (x„ ,^)6T(0 o )} (56) 

Clearly B(x, .x,) is a subset of 0 Furthermore, the family {0(x„ ,x,)} 

is a family of confidence sets with a confidence coefficient y = I - a since one has 
{B(AT„ , XJ contains 0 O } if and only if one has {(AT lt , X n ) e t(0 o )}, and so 

P, 0 [B(Ar It , X K ) contains 0 O ] = P H [(X t , , X n ) e X(0 o )} « 1 - « 
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EXAMPLE 25 Let X u ... , X n be a random sample from N(0, 1), and consider 
testing yf 0 : 0 = 0 o . A test with size a is given by the following: Reject 
if and only if |x - 0 O | ^ zjjn, where z is defined by <£(z) - <b(-z) = 
1 - a. The acceptance region of this test is given by 

*(0o) = ((- v n ■■■■> x n)‘0 o J= < X < Oq + ~7 =). 

1 v n v™ 

We can now define, as in Eq. (56), 

g(.v„ . . . , x„) = {0 o : (*„ .... x„) e £(0 O )} 

= \°o'-0 0 -~ = < x < ff 0 + -L] 

1 V w V"' 

“ Wo-' * - -7= < 0 o < X + 

' y/n y/nl 

g(*„ is a confidence set (in fact a confidence interval) with a 

confidence coefficient y ~ I — a. //// 


The general procedure exhibited above shows how tests of hypotheses can 
be used to generate or construct confidence sets. The procedure is reversible; 
that is, a given family of confidence sets can be “ reverted” to give a test of 
hypothesis. Specifically, for a given family {S(x*j r of confidence sets 

with a confidence coefficient y, if we defined 

X(0 o ) ={(.r,, . . . , x n ): 0 o e S(x„ . . . , x n )}, (57) 

then the nonrandomized test with acceptance region X(0 o ) is a test of Jf 0 : 0 = 0 o 
with size a = 1 — y. 

The usefulness of the strong relationship between tests of hypotheses and 
confidence sets is exemplified not only in the fact that one can be used to construct 
the other but also in the result that often an optimal property of one carries over 
to the other. That is, if one can find a test that is optimal in some sense, then 
the corresponding constructed confidence set is also optimal in some sense, and 
conversely. We will not study the very interesting theoretical result alluded to 
in the previous sentence, but we will give the following in order to give some idea 
of the types of optimality that can be expected. (See the more advanced books 
of Ref. 16 and Ref. 19 for a detailed discussion.) An optimum property of 
confidence sets is given in the following definition. 
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Definition 19 Uniformly most accurate A family {§*(*„ , *„)} of 

confidence sets with a confidence coefficient y is defined to be a uniformly 
most accurate family of confidence sets at a confidence coefficient y if for 
any other family {S(*i» , *„)} of confidence sets with a coefficient y 

P t [B*(X „ , X„) contains 0 ) £ , X„) contains 6 ] 

for all 6 and 0’ (J}( 

Definition 19 is saying that 0*(Ai, , A',) is less likely to contain an 

incorrect 9 thanisSpf,, , AQ, whereas bothS*(A’’,,» , AQand SfY,, ,XJ 
have the same probability of containing the correct 9 As you may have guessed, 
uniformly most accurate confidence sets rarely exist However, uniformly most 
accurate confidence sets within restricted classes of confidence sets could also be 
defined and then one could be hopeful of the existence of such optimal confidence 
sets A general type of result that derives from the close relationship between 
tests of hypotheses and confidence sets is the following If Y* is a uniformly most 
powerful $ize-a test of Xf 0 0 = within some restricted class of tests, then the 
confidence set corresponding to Y* is uniformly most accurate with coefficient 
y = 1 - a within some restricted class of confidence sets With such a result 
one can see how an optimality of a test can be transferred to an optimality of a 
corresponding confidence set and therein lies the real utility of the close rela- 
tionship between hypotheses testing and confidence sets 

7 SEQUENTIAL TESTS OF HYPOTHESES 
7.1 Introduction 

Sequential analysis refers to techniques for testing hypotheses or estimating 
parameters when the sample size is not fixed in advance but is determined during 
the course of the experiment by criteria which depend on the observations as they 
occur In this section we propose to consider, and then only briefly, one form 
of sequential analysis, namely, the sequential probability ratio test 

In Sec 2 above we considered testing the simple null hypothesis Xf 0 0 — 6 0 
versus the simple alternative hypothesis XC , 0 = 0, It was shown {Neyman- 
Pearson lemma) that for samples of fixed size «, the test which minimized the 
size, say /?, of the Type II error for fixed size, say a, of the Type I error was a 
simple likelihood ratio test That is, for fixed n and a, p was minimized 
Suppose now that it is desired to fix both a and ft in advance and then find that 
simple likelihood ratio test having minimum sample size n and having size of 
Type I error equal to a and size of Type II error (l The solution of such a 
problem is illustrated in the following example 
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EXAMPLE 26 A manufacturer of a certain component, say, an oil seal, knows 
the history of his current manufacturing process. He knows, for instance, 
that the distribution of lifetimes of the seals now being manufactured is, 
say, JV (100, 100). A new manufacturing process is suggested; the manu- 
facturer wants to continue with his present manufacturing process if the new 
process is not better (longer mean lifetime), yet he also wants to be quite 
certain to switch to the new process if the new process increases the mean 
lifetime by, say, 5 percent. He proposes to take a sample of observations 
of lifetimes of seals made by the new process and then from the sample 
decide whether or not the process has longer mean life. He models the 
experiment by assuming that the random variable Z, representing the life- 
time of a seal manufactured using the new process, is distributed as 
N{0, 100), and he wants to test 0 <> 100 versus 3V t : 6 > 100. He 
fixes his error sizes and wants to determine the sample size n so that, say, 

.01 =a=/ > 0 „ 10 o [reject ^ 0 ] and .05 -/? = P 0 =ios [accept JF 0 ]. 

That is, he seeks to determine n so that there is only a 1 percent chance of 
rejecting that the new process is no better than the old when it is not, yet 
there is a 95 percent chance of rejecting that the mean lifetime of the new 
process is less than 100 when in fact it is 5 percent larger. It can be shown 
that the simple likelihood-ratio test is equivalent to the test of rejecting 
for large X n . Thus he seeks to determine n and k so that 

.01 = iViooK, > k] and .05 = P^ios^n ^ k] 
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implies 


which together imply that 100 + 10(2 32 6)ly/n » 105 — 10(1 645)/ yfn) or 
n w 63 08 so a sample of size 64 is needed //// 


Referring to the above example, the following considerations make se- 
quential analysis interesting both from the theoretical and practical viewpoint. 
In drawing the 64 observations to test J? 0 , it is possible that among the first 
few observations say 20, 30, or 40, the evidence is quite sufficient relative to a 
and P for accepting or rejecting jP 0 , and then observing additional observa- 
tions would be a waste of time and effort In other words, the possibility is 
raised that by constructing the test in a fashion which permits termination of 
the sampling at any observation one can test 3f 0 with fixed error sizes a and p 
and yet do so with fewer than 64 observations on an average This is in fact 
the case although it may at first appear surprising in view of the fact that the 
best test for fixed sample size requires 64 observations The saving in observa- 
tions is often quite large sometimes as much as 50 percent! We will study such 
a sequential procedure in the remaining subsections 


7 2 Definition of Sequential Probability Ratio Test 

Consider testing a simple null hypothesis against a simple alternative hypothesis 
In other words suppose a sample can be drawn from one of two distributions 
(it is not known which one) and it is desired to test that the sample came from 
one distribution against the possibility that it came from the other If X u 
X 2 , denotes the random variables, we want to test X, ~/ 0 ( ) versus 
Jf i X t ~ f 2 ( ) The simple likelihood ratio test was of the following form 

Reject if A = ^ £ for s ome constant A: > 0 

The sequential test that we propose to consider employs the likelihood ratios 
sequentially Define 



Z.p(Xi, U (m) bp* 

L,(x„ ,*„) L,(m) j"[ /|(jtj) 
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for m ^ 1,2,*.., and compute sequentially X u X 2 > . . . , . For fixed k 0 and k t 
satisfying 0 < k 0 < k Xy adopt the following procedure: Take observation x x and 
compute X x , if <£ k 0 , reject if X x > k u accept jf 0 ; and if k 0 <X x < k i7 
take observation * 2 , and compute X z . If X 2 zk 0 , reject ^ 0 ; if X 2 ^k u 
accept and if k$ <X 2 < k Xf observe x Zy etc. The idea is to continue 
sampling as long as k 0 <X } <k x and stop as soon as X m £ k 0 or X m £ k u 
rejecting JP 0 if X m £ k 0 and accepting Jf 0 if X m £ k x . The critical region of 

CO 

the described sequential test can be defined as C = (J C„, where 

n B 1 

Ci “ {(*^1* • • • » ^*r) • ^0 kj(X ft Xy) < kit j = 1,..4|H“ 1, 

fro}- (58) 

A point in C„ indicates that ff 0 is to be rejected for a sample of size n. Sim- 

CO 

ilarly, the acceptance region can be defined as A « (J A nt where 

n*» 1 

A n “ { (Xi , * . « y . A 0 < ;.y(AT, t ^y) ^ kit j — 1* .<•( W “ h 

(59) 

Definition 20 Sequential probability ratio test For fixed 0 < A' 0 < /cj, a test 
as described above is defined to be a sequential probability ratio test. //// 

When we considered the simple likelihood-ratio test for fixed sample size 
n, we determined k so that the test would have preassigned size a. We now 
want to determine k 0 and k { so that the sequential probability ratio test will have 
preassigned a and p for its respective sizes of the Type I and Type II errors. 
Note that 

05 . 

a = Pjrejcct yf 0 \tf 0 >s true] = £ L 0 (n) (60) 

n~l *'C„ 

and 

P = P[accept 1 ffo is false] = £ f Li(n), (61) 

«»1 J A n 

where, as before, A>( w ) IS a shortened notation for J” # J j^n/oC**) 

For fixed a and P, Eqs. (60) and (61) are two equations in the two unknowns k 0 
and Arj. (Both A n and C„ are defined in terms of k 0 and k t .) A solution of 
these two equations would give the sequential probability ratio test having the 
desired preassigned error sizes a and p. As might be anticipated, the actual 
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determination of k 0 and Jkj from Eqs (60) and (61) can be a major computational 
project In practice, they are seldom determined that way because a very simple 
and accurate approximation is available and is given in the next subsection 
We note that the sample size of a sequential probability ratio test is a 
random variable The procedure says to continue sampling until A, «=■ 
Aj[x t , , *„) first falls outside the interval (k 0 , kj) The actual samp! e size th en 

depends on which x,’s arc observed, it is a function of the random variables 
X u X it and consequently is itself a random variable Denote it by N 
Ideally we would like to know the distribution of IV or at least the expectation 
of N (The procedure, as defined, seemingly allows for the sampling to continue 
indefinitely, meaning that N could be infinite Although we will not so prove, 
it can be shown that N is finite with probability 1 ) One way of assessmg the 
performance of the sequential probability ratio test would be to evaluate the 
expected sample size that is required under each hypothesis The following 
theorem, given without proof (see Lehmann [16]), states that the sequential 
probability ratio test is an optimal test if performance is measured using expected 
sample size 

Theorem 10 The sequential probability ratio test with error sizes a and 
p minimizes both <?[N| is true] and d’[N],# , 1 is true] among all tests 
(sequential or not) which satisfy the following P[tf 0 is rejected \jf 0 is 
true] <, « P\XC 0 is accepted] Jf 0 is false] £ p, and the expected sample 
size is finite //// 

Note that in particular the sequential probability ratio test requires fewer 
observations on the average than does the fixed sample size test that has the 
same error sizes In Subsec 7 4 we will evaluate the expected sample size for 
the example given in the introduction in which 64 observations were required 
for a fixed sample-size test with preassigned a and /? 


7.3 Approximate Sequential Probability Ratio Test 
We noted above that the determination of k 0 and k l that defines that particular 
sequential probability ratio test which has error sizes a and P is in general 
computationally quite difficult The following remark gives an approximation 
to k Q and fcj 


Remark Let k 0 and k t be defined so that the sequential probability 
ratio test corresponding to k Q and has error sizes a and p, then k 0 and 
ki can be approximated by, say, k' 0 and k\, where 




a 

rrp 


1 - g 

p 


and k\ 


(62) 
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Proof ^Assume f,P[N « n | Jpj = 1 for / = 0, l.j 
a = P[reject^ 0 I^ 0 is true] = g f L 0 («) £ f f ^(n) 

n = 1 c « n=l J C n 

05 r 

= fc o E J L i(") = P[reject .Jf 0 1 t is true] 

n-1 C„ 

= fco(l -P), 

and hence L 0 k a/(l - jS). Also 
1 - a = P[accept $? 0 1 .jf 0 is true] 

- t f £ f *iM«) 

n— 1 J /l n r - 1 J /( rt 

= k l P[acceptJlf , 0 lJf l is true] = k t p 9 

and hence k t ^ (1 — ct)//?. Note that the approximations ££ = a/( I — /?) 
and &i = (1 — a) Ifi satisfy 


«-r ’- f * k °< k > sL jr‘ k ‘- <°> 

till 

Remark Let a' and /?' be the error sizes of the sequential probability 
ratio test defined by k^ and k\ given in Eq. (62). Then a! + P' £ a. + p. 


proof Let A' and C' (with corresponding A'„ and C n ') denote the 
acceptance and critical regions of the sequential probability ratio test 
defined by k' 0 and k[. Then 




and 


1 - a' = £ j , L 0 (n) S: — j- £ j , L i( n ) ~ — P’’> 

n = l •'vln P n = l A " r 


1 -a 


hence a'(l -/?)< «(1 - A and (1 - <#' < (1 - a'W, which together 
imply that a'(l - P) + (1 - <#' ^ «(1 - £') + (1 - «O0 ora' + P’ <; a + p. 
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Naturally, one would prefer to use that sequential probability ratio test 
having the desired preassigned error sizes a and /?, however, since it is difficult 
to find the k 0 and k t corresponding to such a sequential probability ratio test, 
instead one can use that sequential probability ratio test defined by k' 0 and k\ of 
Eq (62) and be assured that the sum of the error sizes a' and is less than or 
equal to the sum of the desired error sizes a and (1 

7.4 Approximate Expected Sample Size of 
Sequential Probability Ratio Test 

The procedure used in performing a sequential probability ratio test is to continue 
sampling as long as k 0 < X m < k r and stop sampling as soon as <, k 0 or 
X m £ k t If z, = log, t/o(X|)//i(x,)], an equivalent test is given by the following 

Continue sampling as long as log, k 0 < Y, z i < log, > and stop sampling as soon 
as £ z t g log, k 0 (and then reject jf 0 )or £ z, £ log, £, (and then accept ^f 0 ) 

t t 

As before, let N be the random variable denoting the sample size of the sequential 
probability ratio test, and let Z, = log, Equation (64), given m 

the following theorem, is useful m finding an approximate expected sample size 
of the sequential probability ratio test 

Theorem 11 Wald’s equation LetZi,Z 2 , , Z„, be independent 
identically distributed random variables satisfying <?[|Z,|] < oo Let N 
be an integer-valued random variable whose value n depends only on the 
values of the first n z t ’ s Suppose <? [JV] < oo Then 

<?IZ,+---+Z*] = <?[1V] f[ZJ (64) 
proof <f[Z,+-‘- + Z w ] = ^[Z, +• ’ + Z*|JVH 

= £ «f[Z 1+ - + Z„|IV = n]P[JV = n] 

= £ t fi[Z t \N = rt)P[N = n] 

n « l I* I 

= £ Jj t[Z t \N~n]P[N = n\ 

= £ S[Z,]P[N^i] 

-#fZJ gJWifl 

= d-fZJcffiVJ 
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(<?[Z/] = <f[Z||AT £ f] since the event {N ^ /} depends onlyonZ!, Z ^ 
and hence is independent of Z , . Also <f[W] = £ P[N k i] follows from 
Eq. (6) of Chap. II.) mi 


If the sequential probability ratio test leads to rejection of df 0 , then the 
random variable Z, + • • • + Z N <, log, k 0 , but Z x + • • • + Z„ is close to log, fc 0 
since Zj + * * * + Z N first became less than or equal to log, k 0 at the Nth observa- 
tion; hence <f[Z, 4- • • • + Z w ] k log ,A' 0 . Similarly, if the testleads to acceptance, 

(?[Zj 4 -f Z v ] % log, A',; hence 6[Z^ + • * * + Z N ] a; p Iog,£ 0 + (1 - p) log,£j, 

where p = P[ff 0 is rejected]. Using 

- fv1 _ £[Zi + • • • + Z,,] ~ plog c fc 0 + (1 -p)log c E t 
j[ ' ] <f[Z,] ~ 6[Z t ] 

we obtain 


(S’fiV'j Jf 0 is true] : 


g log, A- 0 +(i - a) log, fei 
<?[Z,|pf 0 is true] 

. « log, W -P)] + (1 ~ «) jogt [(i - g)/Pl 

6[Z t \yf 0 is true] 


(65) 


and 


f , , (1 - /?) log, A'p + /? log, 

^ [A | j>f o is false] * ^ Z( | ^ is f a)se ] 

fl-fl) log, [a/(l -/?)] + /? log. R1 ~ 
~ (f[Z,| Jfo is false] 


( 66 ) 


EXAMPLE 27 Consider sampling from N(0, <r 2 ), where <r 2 is assumed known. 
Test Jf 0 '• 0 = 0 o versus — 0,. Now 


z, = log. 


= log. 


fo(Xl) 

AM 

’(lU/2n<r)e’* l(x '~' o)Mi ' 




,2‘ 
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hence 

<f[Z,l JT 0 is true] = - — [(0? - e\) - 29 q(9 0 - 0,)] 

and 

*[Z t \Xo is false] = --1. (0, - 0 O )» 


Fora = 01, /f - 05, a* = 100, 0 O = 100, and 0, ** 105 (as in Example 26), 
Eq (65) reduces to 


^[N|^ 0 is true] « 
and 


01 log, ( 01/ 95) + 99 log, ( 99/ 05) _ 

25/200 * 24 

false] *34 


The average sample sizes of 24 and 34 for the Sequential probability ratio 
test compare to a sample size of 64 for the fixed sample-size test. //// 


PROBLEMS 

i Let A'have a Bernoulli distribution, where P[X— 1]= 0 = J — p[X= 0] 

(a) For a random sample of size 10, test Jft, 0 ^ j versus Jfi 0>J Use 

the critical region 6} 

(0 Find the power function, and sketch it 
(u) What is the size of this test 7 

( b ) For a random sample of size n = 10 

(0 Find the most powerful size- a (a = 054 j) test of tfo 9 = i versus 
Jft 0-=i 

(it) Find the power of the most powerful test ^t$ = i 

(c) For a random sample of size 10, test 0 — ^ versus , 9 = i 

CD Find the minimax test for the loss function 0 ■= f(d a , &o) = t(d x , 0,) 
i(d a , 0>) = 1719 , {{fix , 0 O ) = 2241 

(n) Compare the maximum risk of the minima^ test with the maximum risk 
of the most powerful test given in part (6) 

(d) Again, for a sample of size 10, test 0= J versus 0 ** i Use the 
above loss function to find the Bayes test corresponding to prior probabilities 
given by 

3 4 

5= (1719/2241)(i) l ' + 3 4 ' 
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2 Let A' have the density /( jc; 0) = 0je*~«/ (o> „(*). 

(а) To test yf 0 :B<.l versus : 6 > 1, a sample of size 2 was selected, and the 
critical region C= {(*,, x 2 ): 3/4x, <x 2 } was used. Find the power function 
and size of this test. 

(б) For a random sample of size 2, find the most powerful size-[a = i(l — In 2)1 
test of Jft , : 0 = 1 versus X? t : 6 = 2. 

(c) Are the tests that you obtained in parts (a) and (6) unbiased? 

(d) For a random sample of size 2, find the minimax test of X 0 : 0 O = 1 versus 

i : 0! = 2 using the loss function do) = £(dr, 00 = 0, if(rf 0 ; 00 = 
1 - log t 2, if(d, ; 0 O ) = i + Iog,2. 

(e) For a random sample of size n t find the Bayes test corresponding to prior 

probabilities given by g = f of 0 : 0 = 1 versus X i : 0 = 2 using the loss 
function ^(cf 0 ; 0o) = 00 = 0, £(d 0 ; 00 = 1, ; 0 O ) = 2. 

CO Test ^f 0 : 0= 1 versus 0 = 2 using a sample of size 2. Let a = size of 
Type I error and jS = size of Type II error. Find the test that minimizes the 
largest of a and /J. 

3 Let § = {1, 2}, and suppose you have one observation from the density 

•fit-*. »+«(*)• Show that a test that has uniformly smallest risk among ail tests 

exists, and find it. 

4 Let X be a single observation from the density 

/(•*; 0) = 8x?~ l I(o. i)(x), 


where 0 > 0. 

(a) In testing ft ? 0 : 0 <! 1 versus 0 > 1, find the power function and size of 
the test given by the following: Reject Jifo if and only if X> 

( b ) Find a most powerful size-a test of XCa'- 0=2 versus XF » : 0=1. 

(c) For the loss function given by C(d Q \ 2) = £{du 1) = 0, £(d 0 \ 1) = 2) = 1, 

find the minimax test of fd o- 0 = 2 versus Xif i‘. 0=1. 

(d) Is there a uniformly most powerful size-a test of : 0 S 2 versus Jfi'.d <2? 

If so, what is it? 

(e) Among all possible simple likelihood-ratio tests of Xf 0 \ 0 = 2 versus 
0=1, find that test that minimizes a + 0, where a and § are the respective 
sizes of the Type I and Type II errors. 

(/) Find the generalized likelihood-ratio test of size a of Jf 0 : 0=1 versus 

jr a :0*l. 

5 Let Z be a observation from the density f(x; 0) = (28 x + 1 - 8)h°. uW, 
where —1 ^0<!l. 

(a) Find the most powerful size-a test of ^f 0 : 0 = 0 versus^: 0=1. (Your 

test should be expressed in terms of a.) , 

(b) To test Xe a. 0^0 versus /A: 0>O, the following procedure was used: 
Reject y/’o if ^ exceeds h Find the power and size of this test 

(c) Is there a uniformly most powerful size-a test of Xf 0 : 8 ^ 0 versus yf t :6>01 
If so, what is it? 
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(d) What IS the generalized likelihood ratio test of jf B 0 = 0 versus jY t 8*01 

(e) Among all possible simple likelihood ratio tests of JY 0 0 = 0 versus Jf t 
8= 1 find that test which minimizes a + p, where a and 0 are the respective 
sizes of the Type I and Type II errors, 

(f) Given a set of observations, all of which fall between 0 and 1, indicate how 
you would test the hypothesis that the observations came from the density 
/(*, *) 

6 Let Xi, , X. denote a random sample from/(x, ff) = (I/0)/ to *(x) and let 
Yu , y. be the corresponding ordered sample. To test d »0 O versus 

B f-8o, the following test was used Accept if 
otherwise reject 

(a) Find the power function for this test, and sketch tL 
(i) Find another (nonrandomized) test that has the same size as the given test, 
and show that the given test is more powerful (for all alternative 8) than the 
test you found 

7 LetA'i, , X. denote a random sample from 

f{x,8)={\lff)x"-'‘ , I i * u (x) 

Test tfo 8 <i 8a versus jf, 8>8 0 

(a) For a sample of size n , find a uniformly most powerful (UMP) size a test if 
such exists 

(b) Take n = 2, 6 a = 1, and a = 05, and sketch the power function of the UMP 
test 

8 Let Xu , X, be a random sample from the Poisson distribution 

e-*e * 

f(x, 8) = -—j- /,* , * „,(x) 

(a) Find the UMP test of jPq 8 = 6 a versus 8>8o, and sketch the power 
function for 00 = 1 and n = 25 (Use the central limit theorem Pick 
«« 05) 

(W Test XC o 0 = 0 O versus X ? i 8 * 8 a Find the general form of the critical 
region corresponding to the test arrived at using the generalized hkehhood- 
ratio principle. (The critical region should be defined in terms of 2 Xt ) 

(c) A reasonable test of Xf 0 8 = 0 O versus Xf t 8 # 0 O would be the following 
Reject if ( X- Bo | & K For a = 05. find Kto that /’[reject Xf B ] XC o] = 05 
(Assume that n is large enough so that the central limit theorem can be used 
to find an approximation to K ) 

9 Let S = {0o , 0|) Show that any test arrived at using the generalized likelihood 
ratio principle is equivalent to a simple likelihood ratio test 

10 To test I versus J? 7 , 0> 1 on the basis of two observations, say X t and 

Xt , from the uniform distribution on (0, 6), the following test was used 

Reject Xt?a if X, + X^l 
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{a) Find the power function of the above test, and note its size. [Recall that 
X t -f X 2 has a triangular distribution on (0, 20) .] 

(b) Find another test that has the same size as the given test but has greater power 
for some 0 > 1 if such exists. If such does not exist, explain why. 

11 Let X u *-•» X* be a random sample of size n from f{x\ 0) = 6 2 xe-* x I l0t «.,(*). 

(a) In testing JY o: 0 <C1 versus JY 6 > 1 for n — 1 (a sample of size I) the 
following test was used : Reject Xf 0 if and only if X x gL Find the power 
function and size of this test. 

(b) Find a most powerful sizc-a test of JY 0 : 0=^1 versus JP X : 0 = 2. 

(c) Docs there exist a uniformly most powerful size-a test of JY 0 : 0 ^ 1 versus 

If so, what is it? 

(d) In testing jf 0 * 0—1 versus 3Y X : 0~2 t among all simple likelihood-ratio 
tests find that test which minimizes the sum of the sizes of the Type I and Type 
II errors. You may take n — l. 

12 Let X x , . ♦ , t A', be a random sample from the uniform distribution over the interval 

( 0 , 0 -f 1). To test jY 0 : 0 versus JY X : 0>0 y the following test was used: 

Reject 3Yo if and only if Y, ;> I or Y x ;> A*, where k is a constant. 

(a) Determine k so that the test will have size a. 

(b) Find the power function of the test you obtained in part (<z). 

(c) Prove or disprove: If k is selected so that the test has size a, then the given 
test is uniformly most powerful of size a. 

13 Let X u • Xm be a random sample from the density and let 

Y u .. be a random sample from the density /*“*/( 0 , n O). Assume that 
the samples arc independent. Set f/, ~ — log. A'*, /=], w, and = 
-log, YjJ - !,*♦*, /?. 

(c) Find the generalized likelihood-ratio for testing = Qz versus 

3Y i : 0\ 0 2 • 

(b) Show that the generalized likelihood-ratio test can be expressed in terms of 
the statistic 

2u< 

2u<+2Yj‘ 


14 

15 


(c) V ye o is true, what is the distribution of 7? (You do not have to derive it if 
you know the answer.) Does the distribution of T depend on 6 = 6 X = 6 2 
given that jfo is true? 

Find a generalized likelihood-ratio test of size a for testing JYoi 0<, 1 versus 
\ 0 > l on the basis of a random sample Xu •••* X n from f(x f 0) = 


6e~ 9 *! t o, tc)W. 

Let A' be a single observation from the density f(x; 6) = (1 + 6)x?U o. nW, where 


e>-\. 

(а) Find the most powerful size-a test of y? o’- 0 = 0 versus /i : 8 — 1. 

(б) Is there a uniformly most powerful size-K test of^ 0 :6<0 versus i : 6 > 0? 
If so, what is it? 
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« 

(c) Among all possible simple likelihood ratio tests of jP 0 8** 0 versus 

8 * 1 find a test which minimizes 2* + ft, where a and ft are the respective 
sizes of the Type I and Type II errors 

(d) Find a generalized likelihood ratio test of 6 *= 0 versus 0 

16 Let Xu , Jf* be a random sample from .jW, and let Y u ,,y,bea 

random sample from 9, .Xy) Assume that the samples are independent. 

(а) Find the generalized likelihood ratio for testing jf 0 f?i ■= 8i versus 
Jfi e.rtdj 

(б) Show that the generalized likelihood ratio test can be expressed in terms of 
the statistic T<= 2 Mil. X, + 2 *))• Argue (or show) that the distribution 
of T does not depend on 8 = 6i = d 2 when 0 is true 

17 Use the confidence-interval technique to derive a test of 3ft, pi = ht Venus 
jf? l fj. t ^ fj-i in sampling from the bivariate normal distribution Such a test is 
often called a paired t test (See the last paragraph in Subsec. 3 4 in Chap VIIL) 

18 Given the sample (— 2, — 9, — 6, 1) from a normal population with unit variance, 
lest whether the population mean is less than 0 at the 05 level (i e , with probability 
05 of a Type I error) That is, test p ^ 0 at the 05 level versus p>0 

19 Given the sample (—4 4, 4 0, 2 0, —4 8) from a normal population with variance 4 
and the sample (6 0, 1 0, 3 2, — 4) from a normal population with variance 5, test 
at the 05 level that the means differ by no more than one unit Plot the power 
function for this test Plot the ideal power function 

20 A metallurgist made four determinations of the melting point of manganese 
1269, 1271, 1263, and 1265 degrees centigrade Test the hypothesis that the mean 
p of this population is within 5 degrees centigrade of the published value of 1260 
Use a= 05 (Assume normality and <7 J = 5 ) 

21 Plot the power function for a test of the null hypothesis — 1 <p < 1 for a 

normal distribution with known variance using sample sizes 1,4, 16, and 64 (Use 
the standard deviation a as the unit of measurement on the p axis and 05 proba- 
bility of Type I error ) Plot the ideal power function 

22 Let X,, , X, be a random sample of size n from a normal density with known 

variance What is the best critical region for testing the null hypothesis that the 
mean is 6 against the alternative that the mean is 47 

23 Derive a test of a 1 < 10 against Jfi a* SlOforasampleofsizenfromanor* 
mal population with a mean of 0 

24 In testing between two values p» and p t for the mean of a normal population, 
show that the probabilities for both types of error can be made arbitrarily small 
by taking a sufficiently large sample 

25 A cigarette manufacturer sent each of two laboratories presumably identical 
samples of tobacco Each made five determinations of the nicotine content in 
milligrams as follows (i) 24, 27, 26, 21, and 24 and 00 27, 2$, 23, 31, and 26. 
Were the two laboratories measuring the same thing? (Assume normality and a 
common variance ) 
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26 The metallurgist of Prob. 20, after assessing the magnitude of the various errors 

that might accrue in his experimental technique, decided that his measurements 
should have a standard deviation of 2 degrees centigrade or less. Are the data 
consistent with this supposition at the .05 level? (That is, test Xf 0 : 2.) 

27 Test the hypothesis that the two samples of Prob. 19 came from populations with 
the same variance. Use a «= .05. 

28 The power function for a test that the means of two normal populations are equal 
depends on the values of the two means pi and ft* and is therefore a surface. But 
the value of the function depends only on the difference 6 = ^ - p . 2 , so that it 
can be adequately represented by a curve, say P(8). Plot p($) when samples of 4 
are drawn from one population with variance 2 and samples of 2 are drawn from 
another population with variance 3 for tests at the .01 level. 

29 Given the samples (1.8, 2.9, 1.4, 1.1) and (5.0, 8.6, 9.2) from normal populations, 
test whether the variances are equal at the .05 level. 

30 Given a sample of size 100 with X = 2.7 and £ (*« - X) 2 = 225, test the null 
hypothesis «#V. y. — 3 and a 2 *= 2.5 at the .01 level, assuming that the population 
is normal. 

31 Using the sample of Prob. 30, test the hypothesis that y = c 1 at the .01 level. 

32 Using the sample of Prob. 30, test at the 0.1 level whether the .95 quantile point, 
say £=£.* 5 , of the population distribution is 3 relative to alternatives £<3. 
Recall that £ is such that f{x) dx~.95, where f(x) is the population 
density; it is, of course, y + 1.645c in the present instance where the distribution is 
assumed to be normal. 

33 A sample of size n is drawn from each of k normal populations with the same 
variance. Derive the generalized likelihood-ratio test for testing the hypothesis 
that the means are all 0. Show that the test is a function of a ratio which has 
the F distribution. 

34 Derive the generalized likelihood-ratio test for testing whether the correlation of 
a bivariate normal distribution is 0. 

35 lfX u Xt,...<Xn are observations from normal populations with known variances 
o\ y uly cj, how would one test whether their means were all equal? 

36 A newspaper in a certain city observed that driving conditions were much improved 
in the city because the number of fatal automobile accidents in the past year was 9 
whereas the average number per year over the past several years was 15. Is it 
possible that conditions were more hazardous than before? Assume that the 
number of accidents in a given year has a Poisson distribution. 

37 Six 1-foot specimens of insulated wire were tested at high voltage for weak spots 
in the insulation. The numbers of such weak spots were found to be 2, 0, 1, 1, 3, 
and 2. The manufacturer’s quality standard states that there are less than 120 
such defects per 100 feet. Is the batch from which these specimens were taken 
worse than the standard at the .05 level ? (Use the Poisson distribution.) 
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38 Consider sampling from the normal distribution with unknown mean and variance 

(a) Find a generalized likelihood ratio test of jf 0 o 1 ^ al versus Jf, cr J > 

(b) Find a generalized likelihood ratio test of Jfo v* = ol versus Jf, a 1 ^ ej 

39 (a) Suppose (N t , , M) ts multinomiatly distributed with parameters n, 

pi .Pi.whereM+t^B — M — — ftandpt+i « | — Theorem 

8 states that 1 


0=0,- 


i-i npt 


has a limiting chi square distribution Find the exact mean and variance 

of 0 

(5) Let (fit i Vi) be distributed as in part (a) Define 


GJ = 


T* ^ Nj ~ np ^ 

i-i /ip? 


rSeeEq{25)J Find <? [Of] [See Eq (26)] Is <?[<22] for p x =/>?, 

P* * i — p? ♦ i less than or equal to £[Qi ] for arbitrary p,, . p, ♦ i 7 

40 A psychiatrist newly employed by a medical clinic remarked at a staff meeting that 
about 40 percent of all chronic headaches were of the psychosomatic variety His 
disbelieving colleagues mixed some pills of plain flour and water, giving them to 
all such patients on the clinic s rolls with the story that they were a new headache 
remedy and asking for comments When the comments were all in they could be 
fairly accurately classified as follows (0 better than aspirin 8 (u) about the same 
as aspirin 3 (in) slower than aspirin t, and (iv) worthless, 29 While the doctors 
were somewhat surprised by these results, they nevertheless accused the psychiatrist 
of exaggeration Did they have good grounds? 

41 A die was cast 300 times with the following results 


Occurrence 1 2 3 4 5 6 

Frequency 43 49 56 45 66 41 

Are the data consistent at the 05 level with the hypothesis that the die is true? 

42 Of 64 offspring of a certain cross between guinea pigs 34 were red 1 0 were black, 
and 20 were white According to the genetic model these numbers should be tn 
the ratio 9/3/4 Are the data consistent with the model at the 05 level? 

43 A prominent baseball player's batting average dropped from 313 in one year to 

280 m the following year He was at bat 374 times during the first year and 268 
Vm-srs VAhft Vypttfneyn. Yenu'die ifi Vne WievAn'iGtiWnfto® 

ability was the same during the two years ? 

44 Using the data of Prob 43, assume that one has a sample of 374 from one Bernoulli 
population and 268 from another Derive the generalized likelihood ratio test 
for testing whether the probability of a hit is the same for the two populations 
How does this test compare with the ordinary test for a 2 X 2 contingency table? 
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45 The progeny of a certain mating were classified by a physical attribute into three 
groups, the numbers being 10, 53, and 46. According to a genetic model the 
frequencies should be in the ratios p’/2p(l -p)/Q - p y. Are the data consistent 
with the model at the .05 level? 

46 A thousand individuals were classified according to sex and according to whether 
or not they were color-blind as follows: 



Male 

Female 

Normal 

442 

514 

Color-blind 

38 

6 


According to the genetic model these numbers should have relative frequencies 
given by 



<7 

2 


q JL 

2 


where/? — 1 — is the proportion of color-blind individuals in the population. 
Arc the data consistent with the model? 

47 Treating the table of Prob. 46 as a 2 x 2 contingency table, test the hypothesis that 
color blindness is independent of sex. 

48 Gilby classified 1 725 school children according to intelligence and apparent family 
economic level. A condensed classification follows: 



Dull 

Intelligent 

Very capable 

Very well clothed 

81 

322 

233 

Well clothed 

141 

457 

153 

Poorly clothed 

127 

163 

48 


Test for independence at the .01 level. 

49 A serum supposed to have some effect in preventing colds was tested on 500 indi- 
viduals, and their records for 1 year were compared with the records of 500 un* 
treated individuals as follows: 





More than 


No colds 

One cold 

one cold 

Treated 

252 

145 

103 

Untreated 

224 

136 

140 


Test at the .05 level whether the two trinomial populations may be regarded as the 


same. 


480 itsrs of hypothiseJ 


30 According to the genetic model the proportion of individuals having the four 
blood types should be given by 

O q> 

A p* + 2pq 
B r*+2qr 
AB 2pr 

where p-J-^ + r” 3 1 Given the sample O, 374, A, 436, B, 132, AB, 58, how 
would you test the correctness of the model? 

51 Galton investigated 78 families, classifying children according to whether or not 
they were light-eyed, whether or not they had a light-eyed parent, and whether or 
not they had a light-eyed grandparent. The following 2x2x2 table resulted 


Grandparent 


Light Not 


Parent 


Light Not Light Not 


Light 1928 532 396 503 

Not 303 395 225 501 


Test for complete independence at the 01 level Test whether the child classifica- 
tion is independent of the other two classifications at the 01 level 

52 Compute the exact distribution of A for a 2 X 2 contingency table with marginal 
totals N t =4, *=7, N , 33 6, N * = 5 What is the exact probability that 

— 2 log, A exceeds 3 84, the 05 level of a chi square distribution for one degree of 
freedom? 

55 In testing independence m a 2 X 2 contingency table, find the ex3Ct distribution of 
the generalized likelihood ratio for a sample of size 2. Do the same for samples 
of size 3 and 4 Discuss 

54 Let Xi, , X» be a random sample from N(p, c r 1 ), where o 1 is known Let A 
denote the generalized likelihood ratio for testing 3?x> p = p e versus Jf i p p» 
Find the exact distribution of —2 Iog,A, and compare it with the corre- 
sponding asymptotic distribution when is true. Hint 2 M — . S ’) 1 * 3 
2(X,-p)*-«tf-p)* 

55 Here is an actual sequence of outcomes for independent Bernoulli trials Do you 
think p (the probability of success) equals i? 


* f/fs, X fs ff, s ffff ; //s/s, S ////, 
tffsf, ffff*, ffs ff, s ffff, ssfff 
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If you do not think p is i, what do you think/? is? Give a confidence -interval 
estimate of p. If the above data were generated by tossing two dice, then what 
would you think/? is ? If the data were generated by tossing two coins, then what 
would you think/) is ? (If the data were generated by tossing two dice, assume that 
the possible values of p arc//36,/=0, 3 6. If the data were generated by 

tossing two coins, assume that the possible values of p arey/4,/ = 0, 4.) 

56 In sampling from a Bernoulli distribution, test the null hypothesis that p— I 
against the alternative that p = L Let p refer to the probability of two heads 
when tossing two coins, and carry through the test by tossing two coins, using 
a “ P 5=5 -10* (The alternative was obtained by reasoning that tossing two coins 
can result in the three outcomes: two heads, two tails, or one head and one tail, 
and then assuming each of the three outcomes equally likely.) 

57 Show that the SPRT (sequential probability ratio test) of p « p 0 versus \i « for 
the mean of the normal distribution with known variance may be performed by 
plotting the two lines 


7 3 . . , 

■ log, ko 4 — n 




and 


a , , , /*o + J*I 

>' = log, ki 4 — n 

Pi—po 2 


in the ny plane and then plotting 2 X% against n as the observations arc made. 

i 

The test ends when one of the lines is crossed. 

58 Consider sampling from /(*; 0) = (l/0)/ f o. •>(*)» 0>O. Discuss the sequential 
probability ratio test of 0 ** 0 o versus 0 = 0 t with 0 o <0 1 . 

59 Let X u Xz t ...» Xn t ... be independent random variables all having the same 

Bernoulli distribution given by P[ A* — 1]~ 0= 1 — P[X m = 0]. To test o-O— 1 
versus JC *: 0 w J, the following sequential test was used: Continue sampling as 
long as n/2 - 2 < 2 <*/ 2 + 2 I ir and whcn 2 * s first less than or C( I ual t0 

n/2 — 2, accept ff*; and if and when 2 *i is first greater than or equal to nil 4- 2, 
accept JPu Is this test a SPRT? 

69 Assume that X has a Poisson distribution with mean 0. Consider testing 
yC 0 : 0 ~ I versus yP x : 0 — 2. Fix a ~ ~ .05. 

(а) Find the fixed sample size necessary to achieve the prescribed error sizes. 

(б) Derive the (approximate) sequential probability ratio test, and show that it 

K 

can be based on the statistics 2 Xt, n** 1,2, , 

i-i 

(c) Find the approximate expected sample sizes for the sequential probability 
ratio test. 
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1 INTRODUCTION AND SUMMARY 

The purpose of this chapter is to discuss a special case of the linear statistical 
model There is a large amount of material available on this subject, but we 
will only discuss the special case of the simple linear model Some authors 
refer to this as the theory of straight line regression To study this model will 
require the use of some of the theory in previous chapters, such as distribution 
theory, point and interval estimation concepts, and some material from hypoth- 
esistesting This chapter will demonstrate how these concepts can be utilized 
for a situation, the simple linear model, that is important in applied statistics 
In Sec 2 two examples are given to illustrate how the simple linear model 
can be used to simulate real-world problems In Sec 3 the simple linear model 
will be rigorously defined and put into a framework that will allow us to study it 
by using statistical procedures from previous chapters In the remaining 
sections discussion will be centered around point estimation, interval estimation, 
and testing hypotheses on the parameters in the model under two different 
assumptions about the distributions of the random variables in the model 
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2 EXAMPLES OF THE LINEAR MODEL 

In this section we will give two examples to illustrate how the linear model arises 
in applied problems. 


EXAMPLE 1 The distance s that a particle travels in time t is given by the 
formula s = p 0 + p^, where P 1 is the average speed and /J 0 is the 
position at time t = 0. If p 0 and are unknown, then s can be observed 
for two distinct values of t and the resulting two equations solved for P 0 
and /?j. For example, suppose that s is observed to Ic 2 when t = 1, and 
s is 11 when t = 4. This gives 2 = p 0 + p t and 11 = p 0 + 4 p i9 and the 
solution is P 0 = — 1, p t = 3; so s— — 1 + 3/. Suppose that for some 
reason the distance cannot be observed accurately, but there is a measure- 
ment error which is of a random nature. Therefore s cannot be observed, 
but suppose that we can observe Y, where Y— s + E and E is a random 
error whose mean is 0. Substituting for s gives us 

Y-Po + fiit + E, (1) 

where Y is an observable random variable, t is an observable nonrandom 
variable, E is an unobservable random variable, and p 0 and p t are unknown 
parameters. We cannot solve for P 0 and p 1 by observing two sets of 
values of Y and t , as we did with s and t above, since there is no functional 
relationship between Y and t. The objective in this model is to find p 0 
and /?! and hence evaluate s = p 0 + p x t for various values of t. Since ^ 
is subject to errors and cannot be observed, we cannot know p 0 and p u 
but by observing various sets of Y and t values statistical methods can be 
used to obtain estimates of p 0 , P u and s. This type of model is a 
functional-relationship model with a measurement error. //// 


EXAMPLE 2 For another example, consider the relationship between the 
height h and weight w of individuals in a certain city. Certainly there is no 
functional relationship between w and h, but there does seem to be some 
kind of relation. We shall consider them as random variables and shall 
postulate that ( W , H) has a bivariate normal distribution. Then the 
expected value of H for a given value w of IT is given by 

£[H\W = w\ = Po+PiW, ( 2 ) 

where p o and Pi are functions of the parameters in a bivariate normal 
density. Although there is no functional relationship between H and W, if 
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they are assumed jointly normal, there is a linear functional relationship 
between the weights and the average value of the heights Thus we can 
write the following H and W are jointly normal, and 

S[H\W =w] = p 0 + p l w, 

or we can write 

H w = /?o + fiiW + E, 

where JE is a normally distributed random variable denoting error This 
is a regression model, and although it came from a somewhat different 
problem than the functional relationship in Example 1, they both are 
special cases of a linear statistical model, which will be discussed in this 
chapter //// 


3 DEFINITION OF LINEAR MODEL 

Let ) be a linear function of a real variable * This is defined by p(x) - 
Po + Pi x > where x is in a domain D Quite often D will be the entire real line, 
a half line, or a bounded interval on the real line To model the situations 
referred to in Examples 1 and 2 above, we assume that there exists a family of 
c d f ’s (one c d f for each x in D) such that the mean of the c d f corresponding 
to a given x (say x 0 ) in D is p 0 + PiX 0 Thus the means of the c d f ’$ are on the 
line defined by ji(x) = p g + p t x See Fig 1 The objective is to sample some 
of the c d f ’s and on the basis of the sample to make statistical inferences about 

Po.Pu etc 

The sampling is accomplished as follows 

(l) A set of n x’s in D is observed and denoted by x lt x 2 , , x„ 

The it’s are not random variables, but they may be selected either by 
some random procedure or by purposeful selection 

(n) Each x t determines a c d f whose mean is fi 0 + p t Xi and 
whose variance is a 2 From this c d f a value is selected at random and 
deno ed by Y, (Y, is a shortened notation for Y Xi ) 

Thus we have a set of n pairs of observations, which we denote by (Yjj jcy), 
(Yj , x 2 ), , (Y„, x„) We have assumed that 

m=Po+Pi*, 

and 


var [Y t ] 
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Therefore we can define random variables E u E 2 , . . . , E n by 

Ei-Yi-P o “ Pi x i for * = h 2, . . . , n, 


and the E x satisfy 


d>[Ei] = 0 


and 


var [Ei] = a 2 . 


So we can write 


Y t = Po + PiXi + ^ for i = 1, 2, w, 


where 

<?[£*] = 0 and var [£,] = a 2 , 
and this defines a linear model. We summarize these ideas below. 


Definition 1 Linear model Let the function n( * ) be defined by 

H(x) =P 0 + Pi x for aI1 x in a set D - For each x in D let Fy *( * ) be a 

c.d.f. with a mean equal to that is, /? 0 + P i x > and variance a 2 . Let 
x 1} x 2 ,...,x n bean observed set of wx’s from £>. For let 7, be a random 
sample of size 1 from the c.d.f. F Yxi { ‘ ) for / = 1, 2, . . . , n. Then (Yi, *i), 
(F 2 x 2 ), • • • j iXn 3 *«) IS a se ^ n observations related by 
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*m-Po +Pi*t 

and (3) 

var [y,] =a 2 , /= 1,2, ,n 

These specifications define a linear statistical model ffll 

Note We can write Eq (3) as 

yj =p 0 + Ai x t + E t 

<?[£,] =0 (4) 

var [£ ( ] = a 1 , 

where i = 1, 2, , n //// 


Note The word “ linear ” m “ linear statistical model ” refers to the fact 
that the function p( ) is linear in the unknown parameters In the 
simple example we have referred to, ji( ) is defined by p(x) = (? 0 + PiX t x 
m D, and this is linear tn x, but this is not an essential part of the definition 
of this linear model For example, Y = p(x) + £, where p(x) = p 0 + j?, e* 
is a linear statistical model //// 

Note In many situations some additional assumptions on the c d f 
Fy m ( ) will be made, such as normality Also, generally the sampling 
procedure will be such that the Y, will be either jointly independent or 
pairwise uncorrelated In fact we shall discuss inference procedures for 
two sets of assumptions on the random variables defined in Cases A and B 
below Hfl 

Case A For this case we assume that the n random variables are jointly 
independent and each Y ( is a normal random variable //// 

Case B For this case we assume only that the F, are pairwise uncor- 
related, that is, cov [ Y t , Yj] = 0 for all i # j = 1 , 2, , n Ml 

For Case A we shall discuss the following 

(0 Point estimation of p Q , /?,, c 2 , and p(x) for any x m D 
00 Confidence interval for P 0 , Pi> c 2 . and p(x) for any x in D 
On) Tests of hypotheses on p 0 , /?,, and a 2 

For Case B we shall discuss the following 

0v) Point estimation of p 0 , P^ a 2 , and p(x) for any * in D 
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4 POINT ESTIMATION— CASE A 

For this case Y t , Y 2 , Y„ are independent normal random variables with 
means p 0 + /?,*„ p 0 + p t x 2 , p Q + p i x n and variances a 1 . To find point 
estimators, we shall use the method of maximum likelihood. The likelihood 
function is 


L (Po>Pu0 2 )=L(P 0 ,Pi,<r 2 ;y it y 2t ...,y n ) 




and 

log L(P 0 , /?„ a 2 ) = - ^ log 27i — ^ log a 2 - £(y, - P 0 - P^,) 2 . 


(5) 


The partial derivatives of logL(/? 0 , /?,, a 2 ) with respect to /? 0 , J?„ and a 2 are 
obtained and set equal to 0. We let /5 0 , o 1 denote the solutions of the 
resulting three equations. The three equations are given below (with some 
minor simplifications): 


Yj (yi ~~ o 

/= i 

X(y»-A>-iW*i ss o C 6 ) 

i= l 

t(y t -h-M 2 =™ 2 

1 = 1 


Tlie first two equations are called the normal equations for determining /? 0 and 
jjj. They are linear in fSr> and /}j and are readily solved. We obtain 


Pt ~ l(x t -x) 2 

h = y- ft* 

5 2 =- £ to -ft, -ft*/) 2 . 

«i= 1 f 


(7) 

( 8 ) 
(9) 


These are maximum-likelihood estimates of Pt, Pot a °d o 2 , respectively. We 
notice that the *,’s must be such that £ (x, - xf # 0; that is, there must be at 
least two distinct values for the x t . 



488 LINEAR MODELS 


Note that since 

/r.Oi, Po , P» e) - (2^)“* exp [-| ^C 

= (2/rtf 1 )"* exp [“ jp 0?o + A*i)*] 

[ l i Pa Pi \ 

xe*P + + 

/ rt (y„ 0 O , p lt a) is a member of a three-parameter exponential family, hence, 
by a generalization of Theorem 16 of Chap VII 


is a set of minimal sufficient and jointly complete statistics Furthermore, 
since the set of statistics given in Eq (10) is a one to-one transformation of the 
estimators (statistics) defined by Eqs (7) to (9), the estimators are themselves 
minimal sufficient and jointly complete 

To further examine the properties that the estimators possess, we shall 
find the joint distribution of statistics corresponding to fi 0 , 5 l To do this, 
we shall first find the moment generating function of 6,, 0 2 , and Oj, which 
are random quantities with values defined by 

a Po-Po a Pi- Pi a _ n * 2 nn 

0 2 _ — — , 0j -p- (11) 

By Definition 25 of Chap IV the joint moment generating function of Oj, 0 2 , 
Oj is defined to be 

m(f„ t 2 , lj) » 

if the expectation exists for —h<U<h for some h > 0 We obtain 

exp f-iHO’.-Po -M*l 

x — — (wp ** <*. 

where in the integral the quantities , 8 Z , will be written in terms of y , and x t 

This integral is straightforward but tedious to evaluate, and the result is 

-*<„ h . i,) - (« P i(*: + **'«[- 

+ >1 F7.-‘ a il) X (1 - »,) for I, <1 



4 


Form- estimation— case a 489 


From this moment generating function we can learn a number of things: 

(i) It factors into a function of /j and t 2 only times a function of *3 
only. We write this result as m(t u t 2 , t 3 ) = m t (t u t 2 )m 2 (t 3 ). By using 
Theorem 10 of Chap. IV we know that the random variables associated 
with ti and t 2 are independent of the random variable associated with 
f 3 ; that is, ©1 and © 2 are independent of © 3 , which implies that the maxi- 
mum-likelihood estimators of f} 0 and f} t are jointly independent of the 
maximum-likelihood estimator of cr 2 . 

(ii) Since by a generalization of Theorem 7 in Chap. II a moment 
generating function uniquely determines the distribution of the random 
variables involved, we shall try to recognize the form of t 2 ) and the 
form of m 2 (t 3 ). We note by Theorem 12 of Chap. IV that m x (t u t 2 ) is the 
moment generating function of a bivariate normal distribution, and, of 
course, we obtain the means, variances, and covariance. We see that 
the random variables, say 0 O and fij, associated with $ 0 and j}j are bivari- 
ate normal random variables with means (fl 0 , and covariance matrix 


I (x, - x ) 2 

-ff 2 X o 2 

.YM-W X(x,-x) 2 _ 

Another way to state this is the following: (B 0 , 6,) is a bivariate normal 
random variable with parameters 

<?[B 0 ]=A>> 


and 


var [B 0 ] = - 

n 

var [Sj] = 


Yfr-*) 2 ’ 

cr 2 

Y(xi-x) 2 ’ 


cov [So , S x ] 




(iii) We recognize that m 2 (t 3 ) is the moment generating function 
of a chi-square random variable with n-2 degrees of freedom. Hence 
we have 



o-W. 

(7 
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which 1 $ distributed as a chi square distribution with n - 2 degrees of 
freedom (Here, and in the rest of this chapter, d 2 and 3 2 are used to 
denote the random variables with values 6 2 and a 2 respectively) By 
Eq (22) of Chap VI we get 



so we define d 2 by 

We shall summarize these results in the following theorem 


Theorem I Consider Case A of the simple linear mode! given m 
Definition 1 The maximum likelihood estimators of 0 )t 0 O , and a 2 
(corrected for bias) are given by 


* Xw-rx*,-*) * 

1 • ° 




( 13 ) 


These estimators satisfy the following 

(i) They are jointly complete sufficient statistics 

(ii) They are unbiased estimators of their respective parameters 

(iu) (6 0 , 6j) is independent of 3 2 

(iv) (fi 0 , fij) has a bivariate normal distribution with mean (/? 0 , ft) 
and covariance matrix given by Eq (12) 

(v) (« - 2 )<t 2 /o 2 is a chi square random variable with n - 2 
degrees of freedom //// 


In Chap VII we noted that maximum likelihood estimators possess a 
number of good properties, but they, m general, are not minimum-variance un- 
biased estimators We now employ a minor generalization of Theorem 17 
of Chap VII along with the results of Theorem ! above, to state a strong op- 
timal property about the estimators fi 0 , 6„ d 2 of ft , ft, c 2 

Theorem 2 Consider the simple linear model given in Definition 1 Let 
t(ft, ft. ° 2 ) be any known function of the parameters ft, ft, and a 2 
for which an unbiased estimator exists Then there exists an unbiased 
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estimator of t(/? 0 , P„<j 2 ) that is a function of 8 0 , fi„ and a 2 . We denote 
this estimator by /(B 0 , 8„ d 2 ), and it is the UMVUE of t(/? 0 , p u a 2 ). 

Proof This result follows from a generalization of Theorem 17 
of Chap. VII, since , 8j, c 2 is a set of sufficient complete statistics. //// 

Corollary The UMVUE of each of the parameters p 0 , p t , and c 2 is 
given by 8 0 , 8 t , and c 2 , respectively, in Theorem 1. //// 

Corollary The UMVUE of }i(x) = /? 0 + P { x for any x in the domain 
D is fl(.v), where fl(x) = 8 0 + OtX. (ft(.v) is the random variable with 
values ii(x) = /5 0 + ft.v.) //// 

Corollary For any two known constants c, and c 2 the UMVUE of 

CiPo+CiPi 'S Ci 8 0 +^8,. llll 


5 CONFIDENCE INTERVALS— CASE A 


To obtain a '/-level confidence interval on a 2 , we note by Theorem 1 that 


U = 


(it — Z)6 2 


is distributed as a chi-square random variable with n —2 d.f. (degrees of free- 
dom). Hence U is a pivotal quantity, and we get 

F’Cxfi — yi/aC^ — 2 ) < U < y_fi + j. ) / 2 ( — 2)] — 7- 


If we substitute for U and simplify, we get 

(n - 2 )d 2 


,r _olz. 

-Xu +-/1/2 1 


(n-2) 


<c‘<, 


(n — 2)6 2 
Xn-y)/2( n — 2) 



(14) 


and this is a 100y percent confidence interval on c 2 . 

To obtain a y-level confidence interval on p 0 , we note that by Theorem 1 : 

(i) Z = (6 o " Po) s/L (*i - x) 2 nlo 2 £ xf is distributed as a stand- 
ard normal random variable. 

(ii) (n - 2)a 2 l<J 2 = U is distributedas a chi-square random variable 
with n — 2 d.f. 

(iii) Z and U are independent. 


Hence, by Theorem 1 0 of Chap. VI 

8 0 Po lnZ( x i-x) 2 
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is distributed as Student s t distribution with n — 2 d.f Hence T is a pivotal 
quantity We get 

•P[^f(i+rt — 2) £ T £ f(i - 2)] = y 
and if we substitute for T, we get 


,) *(» - 2) £ >/ " X) ‘ £ Ui+mi* -2)J«y 

After simplifying we get the following for a 100/ percent confidence interval 
on p 0 


We note that by Theorem 1 we get 


var [8„] = a 1 


I :*? 


and the estimated variance of fi 0 which we write as var [8 0 ] is given by 


var [B 0 ] = a 3 


Then the confidence statement can be written as 


&? 

« 2 (*»“*)* 


P&o — hi +r) j(n — 2)^Var [B 0 ] ^ Po ^ + tyt+ryaO* “ 2) v /var [B a ]] = y 

(15) 

To obtain a y level confidence interval on /J x we note that by Theorem 1 

(i) Z - (fij, - {x t - xf/o 2 is distributed as a standard 

normal random variable 

00 U = {ti — 2 )& 2 /a J is distributed as a chi square random variable 
with n — 2 d f 

(in) Z and V are independent 
Hence by Theorem 10 of Chap VI 

r=< 8 ,-w^SEES 
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is distributed as Student’s t distribution with n-2 d.f. Hence T is a pivotal 
quantity, and we get 

-2 )<.T< f (1+y)/2 (n - 2)] = y. 

If we substitute for T, we obtain 

p £“ f <i+ 7 )/ 2 (« - 2) £ (8j - pj jn&L- C < f (1+yM2 (n - 2)J = y. 

After some simplification we get 

- f ( i +y)/2 (n - ^ /?! < 8 X 

+ w 2{n ~ 2) Jz^W\ =y ’ 

and this is a lOOy percent confidence interval on p { . We note from Theorem 1 
that 

var[ 6 i! 'z^? 

and that the estimated variance of 8j, which is denoted by var [8J, is given by 

var [6j — — 2 * 

E fa ~ *) 

If we substitute this into the confidence-interval statement, we obtain 

P[fii - t u+y)/2 (n - 2) N /var[8 1 ] </l 1 <S 1 + f (1+y)/2 (n - 2),/var [SjJ = y. 

( 16 ) 


To obtain a y-level confidence interval on /i(x) for any x in the domain D, 
we note that 


(i) 

(ii) 

(iii) 

(iv) 


P(X) = /? 0 + /?!*. 

fi(x) =8 0 + fiiW- 
= /*(*)• 

var [ft(x)] = var [8 0 + 8jx] 

= var [8 0 ] + 2xcov [B 0 > 8j + x * var l®il 


Zfri- x) 




Efa -*) L w J 



( x -*) 2 • 
5>i 


( 
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(v) Z - [jK» - Kx)V<J var [AMI is distributed as a standard 
normal random variable 

(vi) U = (n- 2)d I /a 1 is distributed as a chi-square random vari- 
able with « - 2 d f 

(vu) U and Z are independent 

A„.rj - AW ~ K*) ... So + V ~Po~ PiX 

J vfrtAM] Jd'Wn + (* - *) 2 /I (x t - m 

is distributed as Student s t distribution with n — 2 d f Hence T is a 
pivotal quantity, and we obtain 

■P[*“ r (i+ 7 Wi( n — 2) £. T ^ t(i +,vi(n — 2)] ■» y 
If we substitute for T and simplify, we get 

i[b. + V - - 2 )\/ aI [i + y^~5p] s ft> + ft* 
s b„ + B, x + Wi<» + 2 >-/ s ‘[; + ^ry] - 1 


or 

P[B 0 + B,x - *,i +T)/J (n - 2>y var (AO)] ^Po + Pi* 

^ B 0 + B,x + * (l+y)/ j(n - 2)/var[A(x)]] - y, 
and a 100y percent confidence interval for p 0 + p t x is obtained 


6 TESTS OF HYPOTHESES— CASE A 

In the linear model there are many tests that could be of interest to an investi- 
gator For example, he may want to test whether the line goes through the 
origin, i e , to test if the intercept is equal to zero, or perhaps test whether the 
intercept is positive (or negative) These are indicated by 

JP a P 0 = 0 versus je x p 0 /- 0, 

Jf 0 versus p Q < 0, 

Po £ 0 versus X x P Q >0 

These tests indicate that there is no interest in the slope /?, or the variance a 2 
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On the other hand the interest may be in the slope rather than the intercept, and 
an investigator could be interested in testing 

*^o •’ Pi = 0 versus : {3 l # 0, 

J?o'-Pi<0 versus ye x \ p x > 0, 

etc. Rather than testing whether the intercept (or slope) is equal to 0 an investi- 
gator may be interested in testing whether it is equal to a given number. For 
example he may be interested in testing 

y?o-- Po =2 versus yf x : p 0 < 2, 

Jf 0 : Pi = 1 versus # 1, 

etc. We shall derive a test of the hypothesis 

y? 0 : Pi =0 versus yf^.Pi^ 0. 

We could just as well derive a test of the hypothesis 

yC q! Pi ^ 0 versus yf j . Pi ^ 0 

or a test of the hypothesis 


y?o : Pi^Q versus yf j : p t < 0. 

To test 

jf 0 : Pi = 0 versus yf j : p t s 4 0, 
one obvious choice for a test statistic is 

Tz=— h— . 

•s/var [B,] 

Under the random variable Tis distributed as Student’s t distribution with 
n — 2 degrees of freedom. Thus a test procedure with size a is the following: 

Reject if and only if \T\ > r 1 . 8/z (n - 2). 

By comparing this with Eq. (16) we notice that this test is equivalent to the 
procedure of setting a 1 — a confidence interval on the parameter /?j and rejecting 
the hypothesis if and only if the confidence interval does not contain 0. 

We will now show that this test is a generalized likelihood-ratio test. 
Corresponding to the notation in Chap. IX we note that in testing 

Pi = 0 versus &i ’•Pi?* 1 ® 
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the parameter spaces H, £ 5 0 . and 5, are as given below, where 0 - (i? 0 , 0 lt a 2 ): 

5 -KPotPiiO 1 ) “» <Po< 50 ■. -00 <0t <00, o J >0} 

So “Wo. ft. O -<o<^o<oo,f} 1 ^0,a 2 >0} 

6 = §-§»• 

We must determine X, where 


»P *(?./»,• 


,yj 


(17) 


L(0,y lt ,>0=1(0 o.^.or 1 ) 


"®F I, *[- 5 ,Z(F4 ' 


•fc-W 


( 18 ) 


and the values of p 0 , 0 lt e 2 that maximize this for 6 e§ are the maximum- 
bkehhood estimates given in Eqs (7) to (9) Thus we get 


sup£(0, y Xt ,y„) = 


{2nd l f 2 

— (2jr5 l ) - " /2 e“* /2 , 


I(y«-A>-Wl 


where <J 2 = - £ (}»i - /5<> ~ /?i*i) 2 . To find sup 1(0, y lt y K ) t we substitute 

rt J i So 

/?! = 0 into Eq (18) above and get 


But this is the likelihood function for a random sample of size n from a normal 
distribution with mean / ? 0 and variance a 2 . The values of P 0 and tr 1 that 
maximize the likelihood function are the maximum-likelihood estimates 


n=y 

and 




sup £(0, y lt .,*)« (0**2*)-'* exp [- ^ £ (y, - 0$)*] 
= (o* , 2n)-’'*e“*' a . 


Thus 



6 


TESTS OF HYPOTHESES— CASE A 497 


We obtain 


A 


’ff 2 \ n/2 

k S*V 


for the generalized likelihood-ratio. Instead of A we will examine the quantity 
(n — 2)(A~ 2/ " — 1), which is a monotonic function of A and hence will give an 
equivalent test function. We get 


5-2/» _ i g* 1 -^ 2 . Z(y,-- jQ 2 - Z Qq ~ ~ &s,) 2 

5 2 liyi-Po-fod 2 

Replace with fio ~ P ~ K* ln the numerator, and get 


5-2/« _ i £ O’t - ft ~ £ ~ y) z h&i ~ 

&*.) 2 


Hence, 


(„ - 2)(A-*« - » = , 


1) 


d 2 /cr 2 


which is the ratio of the values of two independent chi-square random variables 
(under Pi = 0) divided by their respective degrees of freedom, which are 
1 for the numerator and n — 2 for the denominator. Thus (n - AXA -2 ^ — 1) 
has an F distribution with 1 and n — 2 degrees of freedom under 3 ? § . The 
generalized likelihood-ratio test says to reject X 0 if and only if A <, A 0 , or if 
and only if 

(n - 2)(A _2/ " —!)>(« — 2)(A 0 " 2/n - 1) = A* (say), 


or if and only if 

where AJ is chosen for a desirable size of Type I error. 

Note that ( n - 2)(A" 2/n - 1) is the square of 

6. 

V var 

and recall that the square of a Student’s /-distributed random variable with n- 2 
degrees of freedom has an F distribution with 1 and n -2 degrees of freedom. 
Thus we have verified that if the confidence-interval statement in Eq. (16) is 
used to test Pi versus : JJ, # 0, it is a generalized likelihood-ratio 

test. 

We will generalize this result slightly in the following theorem. 
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Theorem 3 In the linear model given in Definition 1 the generalized 
likelihood-ratio test of size a of /?, =6, (b, is a given constant) versus 
/?, # bi is given by the following Use Eq (16) to set a 1 — a con- 
fidence interval on /?,, and reject 0 and only if the confidence interval 
does not include fcj //// 

We shall state a theorem concerning a test of hypothesis on p 0 , and the 
proof will be asked for in Prob 18 

Theorem 4 In the linear model given in Definition 1 the generalized 
likelihood-ratio test of size a. of Po = (&o >s a E iven constant) versus 

, p 0 jt b 0 is given by the following Use Eq (15) to set a I — « 
confidence interval on /? 0 , and reject jf Q if and only if the confidence 
interval does not include b 0 //// 

There are many other tests that are of interest for the linear model and the 
interested reader can consult Refs 17, 29, 31, and 32 


7 POINT ESTIMATION— CASE B 

For this case 7,, 7 2 , , 7„ are pairwise uncorrelated random variables with 

means /?„ + /J,jr„ P 0 + PiX 2 , , fi 0 + p x x* and variances o 2 Since the joint 
density of the 7, is not specified, maximum-likelihood estimators of Po,Pu and 
a 2 cannot be obtained In models when the joint density of the observable 
random variables is not given, a method of estimation called least squares can 
be utilized 

Definition 2 Least-squares Let (7,, x,), 1=1,2, , n, be n pairs of 

observations that satisfy the linear model given in Definition 1 The 
values of p 0 and ft 1 that minimize the sum of squares 

laWo-M* 

are defined to be the least-squares estimators of P 0 and Pi //// 

To find the least squares estimators of p 0 and P s , we must find the values 
that minimize 


T(R. /n = YfV-rt —RrS 1 
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and clearly these are the same values that maximize the likelihood function in 
Eq. (5). Hence we have the following theorem. 


Theorem 5 In Case B of the simple linear model given in Definition I the 
least-squares estimators of /J 0 and are given by and fij, where 


6 _EO'i-y)(jc 1 -x) 

1 Z (* i -*) 2 


S 0 = y-g 1 x. 


(19) 


llll 


The least-squares method gives no estimator for tr 2 , but an estimator of 
a 2 based on the least-squares estimators of [S 0 and /?, is 

For Case A the maximum-likelihood estimators of fi 0 , and cr 2 had some 
desirable optimum properties. The first corollary of Theorem 2 states that 
and S, are uniformly minimum-variance unbiased estimators* That is, in the 
class of all unbiased estimators of /? 0 and /? i5 the estimators 8 0 and 8 X in Eq. (13) 
have uniformly minimum variance. No such desirable property as this is 
enjoyed by least-squares estimators for Case B. For Case A the assumptions 
are much stronger than for Case B, where the distribution of the random 
variables Y t is assumed to be unknown; so we should not expect as strong an 
optimality in the estimators for Case B. 

For Case B, we shall restrict our class of estimating functions and deter- 
mine if the least-squares estimators have any optimal properties in the restricted 
class. Since S[Yj\ = /? 0 + P\ x i> we see that Po ( and Pi) can be given by the 
expected value of linear functions of the Y ( . Within this class of linear functions 
we will define minimum-variance unbiased estimators. 

Definition 3 Best linear unbiased estimators Let Y l9 Y 2 , . Y„ be 
observable random variables such that <£[YJ = ^(0), where • ) are 
known functions that contain unknown parameters 6 (6 may be vector- 
valued). To estimate any Oj in 6 , consider only the class of estimators 
that are linear functions of the random variables Y$ . In this class consider 
only the subclass of estimators that are unbiased for 6j. If in this 
restricted class an estimator of 6j exists which has smaller variance than 
any other estimator of Oj in this restricted class, it is defined to be the best 
lino nr nnhinvori ostimntnr of Qt f^best” refers to minimum variance). //// 
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It should be noted that there are two restrictions on the estimating func- 
tions before the property of minimum variance is considered First, the class of 
estimating functions is restricted to linear functions of the Yj Second, in 
the class of linear functions of the Y, only unbiased estimators are considered 
Finally, then consideration is given to finding a minimum variance estimator in 
the class of estimating functions that are linear and Unbiased 

We will now prove an important theorem that gives optimum properties 
for the point estimators of p 0 and p t derived by the method of least squares for 
Case B This theorem is often referred to as the Gauss Markov theorem 

Theorem 6 Consider the linear model given m Definition 1, and let the 
assumptions for Case B hold Then the least squares estimators for 
and p 0 given in Eq (19) are the respective best linear unbiased estimators 
for Pi and p 0 

proof We shall demonstrate the proof for p 0 , the proof for /?, 
is similar Since we are restricting the class of estimators to be linear, we 
have B 0 “ E ^ We must determine the constant such that 

0) ^(B 0 ]«/lo that is B 0 is an unbiased estimator of P 0 
(u) var [6 0 ] is a minimum among all estimators satisfying (i) 

For (i) we must have 

p 0 *= s [6 0 ] "• E = E °j(P o + Pi x j) 

This gives the two equations which must be satisfied 

E°/ “ I 

and (20) 

YajXj~0 

Now 


var [B 0 ] = <?[{G 0 - Pq) 1 ] = <f[(E aj Yj - /?„)*] 

“ ^[[E a /fio + Pi x j + Ej) “ Po] 1 } 

- E«; + ^iE m +T.ajEj - A,) 3 ] 

By the restrictions of Eq (20) 

varlBo] - <f[(E ajEjf] = a)E) + £ £ a^EjE^ 



The quantity <?[£,£,] is 0 if t*j since, by assumption, the E, are un- 
correlated and have means 0. Hence 


var [6 0 ] = c 2 Y i aj. 

Since a 2 is a constant, to minimize var [fi 0 ] we need to minimize £ aj. 
Thus constants aj must be found which minimize £ aj subject to the restric- 
tions of Eq. (20). Using the theory of Lagrange multipliers , we must 
minimize 


L-Y.O) *i(L a j — 0 — ^ 2 X a j x j- 
Taking derivatives, one finds 


dL 

da t 

dL 

dX x 

dL 

dX 2 


= 2a t -X l — X 2 x t - 0, t = 

— — ^ Uy *f 1 =0, 


If we sum over the first n equations, we get (using £ a, = 1) 

2 = flA* + 2-2 X . 

If we multiply the /th equation in (21) by Xj and add, we get 

2 ~ X x X Xj *f 2 2 2] */ 3 

or since £ Uy;cy = 0, this becomes 

x, ~ x 


( 21 ) 


( 22 ) 


(23) 


If we substitute this into (22), we get 

, -2 X xjn _ -23c 

2 ~lxj-nx 2 I(x ; -x) 2 
and 

Substituting and 2 2 into the tth equation in (21) and solving for a, 
gives 

d,xfH-xx t 
T,(x,-x) 2 ' 



The best linear unbiased estimator of P 0 is therefore 

So -£ o, r, - ~ *^J lX ' ■ r - 

which is the one given by least squares, and so the proof is complete A 
similar proof holds for f} t //// 


PROBLEMS 

/ Assume that the data below satisfy the simple linear model given in Definition 1 
for Case A 

y -61 -05 7.2 69 -02 -21 -39 38 

x -20 06 14 13 00 -16 -17 07 

Find the maximum likelihood estimates of /? 0 , /? i, and a 1 

2 In Prob 1 find the UMVUE of /3 a + 3ft 

3 In Prob 1 find a 95 percent confidence interval on ft, on ft , on n* 

4 In Prob 1 find a 90 percent confidence interval on ji(x) for x = — I 0 

5 In the simple linear model for Case A find the maximum likelihood estimator of 9, 
where 9 ■= ft + 3ft + 2cr J 

6 In Prob 5 find the UMVUE of 6 

7 In the simple linear model for Case A, show that p proportion of the distribution 

of Tat x = xo is below whereft = ft + ftx 0 + z,aand z,i$givenby fl>(z,) 

8 In Prob 7 find the UMVUE of ft 

9 Use the data in Prob 1 to evaluate the UMVUE of ft in Prob 7 

10 The hardness F of the shells of eggs laid by a certain breed of chickens was as- 
sumed to be roughly linearly related to the amount x of a certain food supplement 
put into the diet of the chickens The model was assumed to be a simple imear 
model for Case A Data were collected and are given below 

y, 70 98 I 16 1 75 76 82 95 f 24 1 75 1 95 

x, 12 21 34 61 13 17 21 34 62 71 

Test the hypothesis that ft = J 00 versus the hypothesis ft ^ 1 00 Use a Type I 
error probability of 5 percent 

11 In Prob 10 test the hypothesis ft > I versus the hypothesis /5, £ 1 

12 In Prob 10 test the hypothesis ft 50) > 1 5 versus the hypothesis ft 50) <. 1 5 
Use a TJpe I error probability of 10 percent 

73 In Prob 10 compute a 90 percent confidence interval on 2or 

14 In the simple linear model for Case A find the UMVUE of ft/o* 

75 Consider the simple linear model given in Definition 1 except var [Yi] = a7 , o\ 
where a tt 7=1, 2, , n, are known positive numbers Find the maximum- 

likelihood estimators of ft and ft, 
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16 What are the conditions on the x t in the simple linear model for Case A so that 
So and are independent? 

17 In the simple linear model for Case A show that F and 6 1 are uncorrelated. Are 
they independent? 

IS Prove Theorem 4. 

19 In Theorem 6 give the proof for the best linear unbiased estimator of J3 t . 

20 For the simple linear model for Case B prove that the best (minimum-variance) 
linear unbiased estimator of -f/?i is B 0 + where So and Si are the least- 
squares estimators of j9 0 and /? t , respectively. 

21 Extend Prob. 20 to c 0 £ 0 + c t p i9 where c 0 and c t are given constants. 



XI 

NONPARAMETRIC METHODS 


1 INTRODUCTION AND SUMMARY 

The important place ascribed to the normal distribution in statistical theory is 
well justified on the basis of the central limit theorem However, often it is 
not known whether the basic distribution is such that the central hmit theorem 
applies or whether the approximation to the normal distribution is good enough 
that the resulting confidence intervals and tests of hypotheses based on normal 
theory are as accurate as desired For example, if a random sample of size n 
is taken from a population with a normal density and a 95 confidence interval 
is set about the mean (see Sec 3 1 of Chap VIII) then the frequency interpreta- 
tion is the following If repeated random samples are taken from this popula- 
tion and if a 95 percent confidence interval is obtained for each random sample, 
in the long run 95 percent of these intervals will contain the mean of the density 
If sampling is from a density that is not normal, then, instead of 95 percent of 
the intervals containing the mean, it may be 99 or 90 percent, or some other 
percentage If it is close to 95 percent, say 93 to 97 percent, usually the experi 
menter will be satisfied However, if it deviates a large amount from the desired 
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percentage, then the experimenter will probably not be satisfied. In cases where 
it is known that the conventional methods based on the assumption of a normal 
density are not applicable, an alternative method is desired. If the basic distri- 
bution is known (but is not necessarily normal), one may be able to derive exact 
(or sufficiently accurate) tests of hypotheses and confidence intervals based on 
that distribution. In many cases an experimenter does not know the form of the 
basic distribution and needs statistical techniques which are applicable regardless 
of the form of the density. These techniques are called nonparametric or 
distribution-free methods. 

The term “nonparametric 1 ’ arises from considerations of testing hypoth- 
eses (Chap. IX). In forming the generalized likelihood-ratio, for example, one 
deals with a parameter space which defines a family of distributions as the para- 
meters in the functional form of the distribution vary over the parameter space. 
The methods to be developed in this chapter make no use of functional forms 
or parameters of such forms. They apply to very wide families of distributions 
rather than only to families specified by a particular functional form. The term 
“distribution-free” is also often used to indicate similarly that the methods do 
not depend on the functional form of distribution functions. 

The nonparametric methods that will be considered will, for the most part, 
be based on the order statistics. Also, although the methods to be presented are 
applicable to both continuous and discrete random variables, we shall direct our 
attention almost entirely to the continuous case. 

Section 2 will be devoted to considerations of statistical inferences that 
concern the cumulative distribution function of the population to be sampled. 
The sample cumulative distribution function will be used in three types of 
inference, namely, point estimation, interval estimation, and testing. Popula- 
tion quantiles have been defined for any distribution function regardless of the 
form of that distribution. Section 3 deals with distribution-free statistical 
methods of making inferences regarding population quantiles. Section 4 studies 
an important concept, that of tolerance limits . The similarities and differences 
of tolerance limits and confidence limits are noted. 

In Sec. 5 we return to an important problem in the application of the 
theory of statistics. It is the problem of testing the homogeneity of two popula- 
tions. This problem was first mentioned in Subsec. 4.3 of Chap. IX when we 
tested the equality of the means of two normal populations. It was considered 
again in Subsec. 5.3 of Chap. IX when we tested the equality of two multi- 
nomial populations. We indicated there that the derived test using a chi-square- 
type statistic could be used to test the equality of two arbitrary populations, and 
so we had really anticipated this chapter inasmuch as we derived a distribution- 
free test. Other distribution-free tests of the homogeneity of two populations 
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will be presented in Sec 5 Included will be the sign test, the run test, the 
median test, and the rank sum test 

In this chapter we present only a very brief introduction to nonparametnc 
statistical methods This chapter is similar to the last inasmuch as it includes 
use of the three basic kinds of inference that were the focus of our attention in 
Chaps VII to IX We shall see that much of the required distributional theory 
is elementary, seldom using anything more complicated than the basic principles 
of probability that were considered in Chap I and the binomial distribution 


2 INFERENCES CONCERNING A 

CUMULATIVE DISTRIBUTION FUNCTION 

2 1 Sample or Empirical Cumulative Distribution Function 
In Subsec 5 4 of Chap VI, we defined the sample cumulative distribution 
function(cd f ) We indicated there that it could be used toestimate thecumula- 
tive distribution function from which we sampled In this subsection some 
results about the sample c d f will be reviewed and used to formulate point 
estimates In the two following subsections the sample c d f will be utilized to 
test a hypothesis (in Subsec 2 2) and to set a confidence interval (in Subsec 2 31 
Recall that (see Definition 13 in Chap VI) the sample c d f is defined by 

F„(*) — (number of X, less than or equal to x ) 

,/n a) 

where X lt X H is a random sample from some c d f F( ) According to 
Theorem 17 of Chap VI, 

=“] = Q[F(x)] l [l-F(.x)Y~ k , * = 0,1, ,n, (2) 

where F„( ) is the sample c d f corresponding to c d f F( ) From Eq (2), 
we see that 

<T[F.M] - i \ (J) [F&clrtl - F(*)r * = F(x) (3) 

and similarly 

var[F,(jc)]=iF(*)[t-F(*)] (4) 
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In fact, since F„(x) is the sample mean of random variables 
•• • > we ^ now by the central-limit theorem that F„(x) is asymptoti- 

cally normally distributed with mean F(x) and variance (l/«)F(x)[l -F(*)]. 

Equations (3) and (4) show that for fixed x, F n (x) is an unbiased and 
mean-squared-error consistent estimator of F(x), regardless of the form of F( • ). 
If one is interested in estimating F(x) for every x (rather than for a fixed x), 
then one is interested in saying something about how close F„(x) is to F(x) 
jointly over all values *; hence the following result is of interest: 

PI sup \F n (x)-F(x)\ ► 0] = 1. (5) 

— co <x<co tt“+co 

Equation (5), known as the Glivenko-Cantelli theorem , states that with prob- 
ability one the convergence of F n (x) to F(x) is uniform in x* We can define 

A,= sup \F tt (x)-F(x)\. (6) 

-oo <x<co 

D n is a random quantity that measures how far F„( • ) deviates from F( • ). 
Equation (5) states that P[lim D n = 0] = l; so, in particular, the c.d.f. of £>„, 

n~»c 0 

say F Dn ( * ), converges to the discrete c.d.f. that has all its mass at 0. In the next 
subsection we will consider the limiting distribution of yfnD n . Equation (5) 
tells us that the estimating function F„{x ) of the c.d.f. F(x ) converges to F(x) 
uniformly for all x with probability one. 

Instead of a point estimate of F(x) = P[X < x], one might be interested 
in a point estimate of F(y) — F(x) — P[x < X << y] for fixed x < y. The follow- 
ing remark is useful in showing that F n (y ? ) — F n (x) is an unbiased mean-squared- 
error consistent estimator of F(y) - F(x). 

Remark 

cov [F„(x), F n ()01 » - F(x)[l - F(y)] for y^x. (7) 

ft 

PROOF 

f ] n 1 n “I 

“ JEJl-'O.x-fXl)’ n jE.Jl-n.n&j ) J 
= (i) 2 cov [!/(-«,*» 

= f-V t I COV 

W 1 

= - COV [J(-<o,x](^l)» h-<a.yPt i)1 
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«i[F(*)-F(x)F(y)] 

= iF(x)[l-F(y)] //// 

Using Eq (7), one sees immediately that 
var [F,(y) - F.(x)] = var [F.OOl - 2 cov [F.W, F„0>)] + var [F,(x)] 

- i [F(y) - F(x)][l - F(y) + Ft*)], 

mean squared error consistency of F„(y) - F„(x) as an estimator of F(y) - F(x) 
follows immediately 

Rather than estimating P[x <X£y], i e , the probability that X falls 
in some interval, one might consider estimating P[X e £], i e , the probability 

that X falls in some set B It can be shown that (seethe?roblems){l/n) X 
is an unbiased estimator of PIA" e B] and 

var [« J?/^] ^ { - p[Xe B K1 “ P[X B BJ), 
hence, is mean squared error consistent 


2 2 Kolmogorov-Smirnov Goodness-of-fit Test 

We noted above that F„(x) has an asymptotic normal distribution Equivalently, 
*/n [F b (x) - F(x)] has a limiting normal distribution with mean 0 and variance 
F(x)[l — F(x)] We now state (without proof) a result that gives the limiting 
distribution of 

JiD* = Jn sup (F„(x) - F(x)l 
-«<*<« 


Theorem 1 Let X u , X K , be independent identically distributed 
random variables having common continuous c d f F x ( ) - F( ) 
Define 
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where F n (x) is the sample c.d.f. Then 

lim *W») = Pls/nDn < x] m 

n-+co to; 

= [* le ~ 2jiX *\ho.a»(x) = H(x), say. //// 

The c.d.f. given in Eq. (8) does not depend on the c.d.f. from which the 
sample was drawn (other than that it be continuous); that is, the limiting 
distribution of -JnD n is distribution-free. This fact allows D„ to be broadly 
used as a test statistic for goodness of fit. For instance, suppose one wishes to 
test that the distribution that is being sampled from is some specified continuous 
distribution; that is, test s/d 0 : X i ~ F 0 ( ■ ), where F 0 (-) is some completely 
specified continuous c.d.f. If fd 0 is true, 

K n = d n (X u ...,X n ) = s fk sup |F fl (x) — jF 0 (x) | (9) 

— GO <X<CQ 

is approximately distributed as H( * ), the c.d.f. given in Eq. (8). J[jP 0 is false, 
then F n ( * ) will tend to be near the true c.d.f. F( • ) and not near F 0 ( • ), and 
consequently sup | F n (x) — F 0 (x ) | will tend to be large; hence a reasonable 

— oo <x<m 

test criterion is to reject Jf 0 if sup \F n {x) - F 0 (x)\ is large. Since 

— 00 <x<co 

K n — -J* 1 sup | F n {x) — F 0 (x) \ is approximately distributed as H(-) when 

-00<JC<00 

2P 0 is true and H(') has been tabulated, k x ^ a can be determined so that 
1 =a, and hence P[K n > k x - a ]& a. That is, the test defined by 

“Reject if and only if K n > k x - a ” has approximate size a. Such a test is 
often labeled the Kolmogorov-Smirnov goodness-of-fit test. It tests how well a 
given set of observations fits some specified c.d.f. F 0 ( 4 ). The fit is measured 
by the so-called Kolmogorov statistic sup | F n {x) - F 0 {x) | . Theorem 1 gives 

— co < x < cc 

an asymptotic distribution for D n . The exact distribution of D n has been 
tabled for various n. See Ref. 44. 


EXAMPLE 1 A question of at least curious interest is the following: Are 
the times of birth uniformly distributed over the hours of the day? For 
37 consecutive births (actual data) the following times were observed: 
7:02 p.m., 11:08 p.m., 3:56 a.m., 8:12 a.m., 8:40 a.m., 12:25 p.m., 1 :24 a.m., 
8:25 a.m., 2:02 p.m., 11 :46 p.m., 10:07 A.M., 1 :53 p.m., 6:45 p.m., 9:06 a.m., 
3:57 p.m., 7:40 a.m., 3:02 a.m., 10:45 a.m., 3:06 p.m., 6:26 a.m., 4:44 p.m., 
12:26 A.M., 2:17 p.m., 11:45 p.m., 5:08 a.m., 5:49 a.m., 6:32 a.m., 12:40 p.m., 
1 :30 P.M., 12:55 P.M., 3:22 p.m., 4:09 p.m., 7:46 p.m., 2:28 a.m., 10:06 a.m., 
11:19 a.m., 4:31 p.m. Both the hypothesized uniform c.d.f. and the sample 
c.d.f. are sketched in Fig. 1. 
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One can calculate y/nsup |F.(jr)-F(x)| -\/37|ff — HttI « 85 
The cntical value for sue a = 10 is greater than 1 22, so, according 
to the Kolmogorov Smirnov goodness-of fit test, the data do not indicate 
that the hypothesis that times of burth are uniformly distributed through 
out the hours of the day should be rejected //// 

The Kolmogorov Smirnov goodness-of fit test assumed that the null 
hypothesis was simple, that is, the null hypothesis completely specified (no 
unknown parameters) the distribution of the population One might inquire 
as to whether such a goodness-of fit testing procedure can be extended to a 
composite null hypothesis which states that the distribution of the population 
belongs to some parametric family of distributions, say {F( , (?) 0 e §} For 
such null hypotheses sup |F„(;r) — F(;c,0)| is no longer a statistic since it depends 

on an unknown parameter 9 An obvious way of removing the dependence on 
0 is to replace 6 by an estimator, say 0, similar to what was done in the classical 
chi square goodness-of fit test. The test statistic then becomes sup |F,(x) 

- F(*,0)I Thedistnbutionofsuchateststatisticisnotknownand, ingeneral, 
depends on the hypothesized parametric family Although some studies (often 
Monte Carlo) have been reported in the literature, much remains to be done 
before a Kolmogorov Smirnov goodness-of fit test for composite hypotheses 
becomes a practical testing tool 
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2.3 Confidence Bands for Cumulative Distribution Function 

Theorem 1 can also be used to set confidence bands on the c.d.f. F( • ) sampled 
from. Let k 7 be defined by H(k y ) = y, where H{ • ) is the c.d.f. in Eq. (8). A 
brief table of y and k r is 


y 

.99 

.95 

.90 

.85 

.80 

kr 

1.63 

1.36 

1.22 

1.14 

1.07 


It follows that 


P{y/n sup | F n (x) - F(x) | < fc 7 ] « v, 

X 


but 

P\_-Jn sup |F„(x) - F(x)\ < Je 7 ] = P[ sup |F„(x) - F(x)\ < k y ljri\ 

X X 

= p[f„(*) - < Fix) < F n (x) + for all x] , 

L y/n y/rt J 


noting that 

sup |F„(;r) — F(x)| <~= 

X V" 

if and only if 


F„(x) 


-h=<Fix)<F n {x) + 
V« 



for all *. Using the fact that 0<F(x)<l, we have 
P [max jo, F„(x ) < F(x) < min [f„(x) + ^=, l] for all *] « y; (10) 

that is, the band with lower boundary defined by L{x) = max [0, F n (x) — k y l -Jn] 
and upper boundary defined by U(x) = min [F„(x) + kjji, 1] is an approxi- 
mate lOOy percent confidence band for the c.d.f. F( • ), where the meaning of the 
confidence band is given in Eq. (10). 
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3 INFERENCES CONCERNING QUANTILES 
3 1 Point and Interval Estimates of a Quantile 

Throughout this section we will assume that we are sampling from a continuous 
c.d f , say F( ) Recall (see Definition 17 of Chap II) that the gth quantile of 
c d f F( ) denoted by is defined by F(£,) -q for fixed ? 0<f<l In 
particular, for q — } is called the median We saw in Subsec 4 6 of Chap II 
that quantiles can be used to measure location and dispersion of a c d f For 
instance, ({ f + fi- t )/2, etc, are measures of location, and 
{ 7 j — f 23 , etc , are measures of dispersion 

In Subsec 2 1, we considered estimating F(x) for fixed x, now, we con 
sider estimating such that F(£ ? ) = q for fixed q We know that if A' is a 
continuous random variable with c d f F( ) then the random variable F(X) 
has a uniform distribution (see Theorem 12 in Chap V) over the interval (0 1) 
Hence F(Yj ) has the same distribution as the yth order statistic from a uniform 
distribution and we know that <?[F(1})] =j/(n + 1) (As usual, Y lt , Y, 
are the order statistics corresponding to the landom sample X u , X„ ) 
Consequently we might estimate with Y j if q a j/(n + 1) [If //(n + 1 )<q< 
0 + l)/(« + 0 one could estimate by interpolating between the order statistics 
Yj and Yj +l ] 

A confidence interval estimate of can be obtained by using two order 
statistics, the interval between them constituting the confidence interval We 
are interested in computing the confidence coefficient for a pair of order statistics 

P[Yj zt'Z rj * mi}) s F(Q = g s F(yj] 

= I - P[F(Yj) >q\- F[F(7*) < q] 

= PlF{Yj)Zq]-P[F(Y k )<q] 

Recall that 


frtt) = - F(y)r J /(y). 

hence for Z =^F(Yj) 


dz 

rr f » 


m -\m\ f,fy) = < r - m - ji' 2 " ,(1 ~ zr 
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Thus, 


PinYj)<u] = ff z (z)dz 
J o 

= nT- r T"n f zJ ~ 1 0 -z) n ~ J+1 ~ 1 dz 

B(y, n —j + 1) J 0 

= IB u (y, n -j + 1), 

called the incomplete beta function, which is extensively tabulated. Hence, 


P[Yj <;£,<; yj = m,(y, « -j + 1) - iB,(fc, n - k + 1 ), 

which is the confidence coefficient of the interval ( Y } , Y k ). In practice, of course, 
we are interested in going in the other direction; that is, for fixed y pick j and k 
(and consequently order statistics Y } and Y k ) such that 


IB,(y, n -j + 1) - IB, (A:, n - k + 1) - y, 


and then (Yj, Y k ) is a lOOy percent confidence interval for <!;,. Of course, for 
arbitrary y there will not exist a j and k so that the confidence coefficient is 
exactly y. 

The confidence coefficient can be obtained another way. 


P[Yj < < Y k ] m P[Yj < f,] - P[Y k < y. 


But 


P[Yj <, f,] = P[/th order statistic < 

= P[J or more observations < £,] 

n 

= £ P[exactly / observations ^ £,] 

t=J 

hence, 

nr, z (,* rj = % (”)«'d - «)-' - i (")«* - «r ' 

Note that a table of the binomial distribution can now be used to evaluate the 
confidence coefficient. 
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EXAMPLE 2 For a sample of size 10, what is the confidence coefficient of 
the interval (V 2 , y,), which is a confidence interval estimator of the pop- 
ulation median? We have 

ftv, (") (“) (i)'“ - 9784 III I 

We have presented one way, using order statistics, of obtaining point esti- 
mates or confidence interval estimates for a quantile 

Besides being extremely general in that the method requires few assump- 
tions about the form of the distribution function, the method is extraordinarily 
simple No complex analysis or distribution theory was needed, the simple 
binomial distribution provided the necessary equipment to determine the con 
fidence coefficient The only inconvenience was the paucity of confidence levels 
that could be attained 

3 2 Tests of Hypotheses Concerning Quantiles 

Let X u , X n denote a random sample from a probability density function, 
say/{ ) Suppose that it is desired to test that the gth quantile of the popula- 
tion sampled from is a specified value, say £ That is, it is desired to test 

Jf o ?<, = 5 versus 

where /( ) is unspecified (other than being a probability density function) 
The confidence interval method of deriving a test (see Subsec 3 4 of Chap IX) 
can be used, for instance, obtain a lOOy percent confidence interval for £,, and 
accept Jif 0 if and only if the derived confidence interval contains £ Such a test 
has size 1 — y 

An alternative test is the so-called one sample sign test It is a very simple 
test based on the value of a statistic that represents the number of the n trans- 
formed observations that have a positive sign To illustrate the principle in- 
volved in the sign test, consider testing X 0 €<=£ versus <*, £ { for a 
random sample X u , X„ from some unspecified probability density function 
/X ) Let Z denote tbs. number of that exceed 4 Equivalently, Z is. the 
number of X t - , X„ — Z that have a positive sign If is true, Z has a 

binomial distribution with parameters n and p = 1 - q = J® /( x) dx So if 
XC 0 is true, one would expect Z to be near np, and hence an intuitively appealing 
test is to accept if and only if Z is near np Since the distribution of Z is 
known, one can determine what is meant by “ near ” by fixing the Size of the test 
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For example, suppose q = \ so that ?, = £* = median; then a possible test of 
versus ^ £ is to accept if and only if \Z-np\ = 

| Z — /j/ 2 I < c, where c is a constant determined by 

P[\Z-nj2\ <c] = \ - a, 
where a is the desired size of the test. Now 


n/2+c 

P[\Z-nl2\<cl= I 

n/2-c \/ 


J n\ n “ J m 
2) ; 


so c can be determined from a binomial table. (For small sample sizes, not 
many cc s are possible, unless randomized tests are used.) The power function 
of such a test can be readily obtained since the distribution of Z is still binomial 
even when the null hypothesis is false; Z has the binomial distribution with 
parameters n and p = P[X > £]. Such a power function could be sketched as a 
function of p. 

Note also that the sign test can be used to test one-sided hypotheses. For 
instance, in testing jt? Q : <> f versus > & the sign test says to reject 

^ o if and only if Z, defined as above, is large. Again the power function can 
be easily obtained. 


4 TOLERANCE LIMITS 

An automatic machine in a ball-bearing factory is supposed to manufacture 
bearings .25 inch in diameter. The bearings are regarded as acceptable from an 
engineering standpoint if the diameter falls between the limits .249 and .251 inch. 
Production is regularly checked each day by measuring the diameter of a random 
sample of bearings and computing statistical tolerance limits L x and L 2 from 
their samples. If L x is above .249 and L 2 is below .251, the production is 
accepted. How large should the sample be so that one can be assured with 
90 percent probability that the statistical tolerance limits will contain at least 
80 percent of the population of bearing diameters? There is a simple non- 
parametric solution to problems of this kind. 

In more general terms, let /( * ) be a probability density function, and on 
the basis of a sample of n values it is desirable to determine two numbers, say 
L t and L z , such that at least .80, say, of the area under /( * ) is between L t and 
L 2 . On the basis of a sample we cannot be certain that .80 of the area under 
/( • ) is between L x and L z , but we can specify a probability that it is so. 
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In other words, we want to find two functions L v » l t {X u , X n ) and 
L 2 = li(X , X„) of the random sample X u ,X n such that the probability 
that 

j Ct f(x)dx^P (II) 

4 1 X 

is equal to y, for specified y and p We summarize with a definition 

Definition t Tolerance limits Let X t , , X K be a random sample from 
continuous c d f F( ) having a density function /( ) Let L t ** 
li(X lt ,X k )<L 1 - l 2 (Xi, , X K ) be two statistics which satisfy 
(0 The distribution of F(L 2 ) — F{Lj) does not depend on F( ) 

(i«) FIF(L 2 )-F(L t )S/i] = y 

Then L x and L 2 will be defined to be 100/? percent distribution free tolerance 
limits at probability level y //// 

Remark Note that the random quantity F(L 2 ) - F(L t ) represents the 

area under /( ) between L x and L 2 //// 

For continuous random variables, order statistics Yj and Y k (J < k) form 
tolerance limits To obtain the coefficients p and y in the definition of tolerance 
limits, we need the distribution of F(L 2 ) - F(L,) Recall that 

frj rSYj . n) = (j 

x [fO' i )l / - 1 [F(y i ) - FCy^] k -*-^[l - F{y*)rVty)/0'*) 

Make the transformation Z = F(y») — F(YJ) and Y - F(Yj), find thejomt distn- 
bution of y and Z, and then integrate out y to get the marginal distribution of Z 
The following obtains 

■>«- (12) 

which is a beta distribution with parameters k—J and n — k +J+ 1 Now 
P[Z <p]~ j" /&) dz = IB# ~J,n~k+j + 1), 
the incomplete beta function, which is tabled Also, recall that 

is# -j,n~k+j+ 1) =z(")/m - 

Thus for any p, the probability level y can be computed 
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EXAMPLE 3 For a random sample of size 5, use (y lt y 5 ) as a tolerance 
interval for 75 percent of the population; that is, p = .75, What is the 
corresponding probability level yl We seek 

y = P[F(y 5 ) — -F(Yi) > .75] 

= i - p[F(y s > - F(y,) < .75] 

= 1 (•)(■? 5)'(.25) 5 -' = .3672. //// 

We might note that, in general, 

£[Z) = TO] - £[F(Yj)] . J- L- = bzl. (13) 

n+1 1 n + 1 y 

For Example 3 ^[Z] = £ = J-. 


EXAMPLE 4 Suppose that it is desired to determine how large a sample must 
be taken so that the probability is .90 that at least 99 percent of a future 
day’s output of bearings will have diameters between the largest and 
smallest observations in the sample. The quantities are y = .90 and 
p = .99, and we want to determine n such that 

P[F(Y n )-F(Y 1 )zp]=y i 

where the density of Z = F(Y n ) — F(Yi) is given by Eq, (12) for j = 1 and 
k ~n. We get 

y — P[Z > /?] - f 1 n{n - \y~ 2 {\ -z)dz = \- np n ~ l + (n - 1)/F. 

If we substitute for y and /?, we get the equation 

.90 = 1 — nW 1 + (n - l)(.99) n , 

which can be solved to determine n. The solution is n & 388. //// 

There are similarities and differences between tolerance limits and con- 
fidence limits. Tolerance limits, like confidence limits, are two statistics, one 
less than the other, that together form an interval with random end points. The 
user of either interval is reasonably confident (the degree of confidence being 
measured by the corresponding confidence level) that the interval obtained 
contains what it is claimed to. This is where the similarity ends. A confidence 
interval is an interval thought to contain a fixed unknown parameter value. On 



5 1 8 MONPAXAvmtir mitt ions 


xi 


the other hand a tolerance mtena! is an interval thought to contain a prescribed 
proportion of the values of the random variable under consideration In other 
words, aconf dence interval is an inters at thought to contain an unknown fixed 
parameter value that characterizes the distribution of population values whereas 
a tolerance interval is an interval thought to contain actual population values 
and not some characteristic of them 


5 EQUALITY OF TWO DISTRIBUTIONS 
5 1 Introduction 

In this section various tests of the equality of two populations will be studied 
Aswe mentioned in See 1 above we first studied the equality of two populations 
when we tested that the means from two norma! populations were equal in 
Subsec 4 3 of Chap IX Then again in Subsec 5 3 of Chap IX we gave a test 
of homogeneity of two populations A great many nonparametnc methods 
have been developed for testing whether two populations have the same distnbu 
tion We shall consider only four of them, a fifth will be briefly mentioned at 
the end of this subsection 

The problem that we propose to consider is the following Let X t , , X m 
denote a random sample of sizem from c d f F*( ) with a corresponding density 
function /,( ) and let T, , Y t denote a random sample of size n from 
cdf F y { ) with a corresponding density function f t { ) (Note that we are 
departing from our usual convention of using Y s to represent the order statistics 
corresponding to the X s ) Further assume that the observations from ) 
are independent of the observations from F r ( ) Test Jf 0 F x (z) - F y (z) for 
all z versus jF, Fj(r) ^ F*(z) for at least one value of z In Sec 2 above we 
pointed out that the sample cdf can be used to estimate the population cdf 
In the case that is true that is F,(-) = Fj(z) we have two independent 
estimators of the common population cdf one using the simple c d f of the 
X s and the other using the sample c d f of the Y s Intuitively, then one might 
consider using the closeness of the two sample cdf s to each other as a test 
criterion Although we will not study it a test called the sample 
Koln ogorov Smirnov test has been devised that uses such a criterion 

We will assume throughout that the random variables under consideration 
a re continuous and merely point out at this time that the methods to be presented 
can be extended to include discrete random variables as well In our pre 
sentation we will consider testing two sided hypotheses and will not consider 
one sided hypotheses although the theory works equally well for one sided 
hypotheses 
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5.2 Two-sample Sign Test 

The first test that we will consider is the two-sample sign test. We shall see 
that, in a certain sense, this test is a nonparametric analog of the paired t test 
(see Prob. 17 in Chap. IX). For this test we assume that the samph'ng situation 
is such that the X and 7 observations are paired; that is, we observe (X t , 7,), 
••** ^n) - One could think that the X observation is “untreated” (a 

control) and the corresponding 7 observation is “treated,” the object of the 
test being to determine if there is a “ treatment ” effect. We wish to test 
•^V Fx ( ' ) — ' )• Assume that (Xj, 7 t ), ..., (X n , 7„) is a random sample 

from some joint distribution F Xi y ( • , • ). Further assume that F Xt y ( • , • ) is 
such that P[X > 7] = P[X < 7] = \ when Xf 0 is true. (Recall that we are 
assuming continuous random variables, and then such an assumption is satisfied 
if X and 7 are independent.) Consider a test based on the signs of the 
differences X t — Y t , i — 1, . n. For instance, define 

Z| = f(0, <n)(Xi ~ T{); 

n 

then Z { has a Bernoulli distribution, and consequently S n — £ Z f has a binomial 

i= 1 

distribution with parameters n and p = P[X t > 7, ]. If XF 0 is true, p = }, and 
<?[5" n ] = n/2. If the alternative hypothesis is two-sided so that p = P[X t > Y t ] 
can be either larger or smaller than then a possible test criterion is to accept 
JF 0 if S„ is close to n/2, that is, accept Jt? 0 if 1-5; - n/2| < k, where k is deter- 
mined by fixing the size of the test, k is easily determined from a binomial 
table, and we have a very simple test of the equality of the two populations. 

One can see that avoidance of the assumption that X, and 7 ( are inde- 
pendent is desirable. For example, X, might represent an observation on the 
zth entity before some “treatment” and 7 f the observation on the same entity 
after “ treatment.” In such a case one is not likely to have independence of 
X, and 7, since they are observations taken on the same entity, yet one can 
sometimes test that there is no “treatment” effect by testing that the “before” 
and “after” populations are the same. 


5.3 Run Test 

As before let X„ .... X„ denote a random sample from F x ( •) and Y u ..., 7„ 
a random sample from Fy( * )• A rather simple test of Xf q F *(-) = F x (z) for 
all z is based on runs of values of X and values of 7. To understand the 
meaning of runs, combine the m x observations with the n y observations and 
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then order (in ascending order of magnitude) the combined sample For 
example, if m ** 4 and n =■ 5, one might obtain 

yxxyxyyyx (14) 

A run is a sequence of letters of the same kind bounded by letters of another 
kind except for the first and last position Thus, in Eq (14) the ordering 
starts with a run of one y value, then follows a run of two x values, then a run 
of one y value, and so on, six runs are exhibited in Eq (14) It is apparent that 
if the two samples are from the same population, the x’s and y’s will ordinarily 
be well mixed, and the total number of runs will be large If the two popula- 
tions arc widely separated so that their range of values does not overlap, then 
the number of runs will be only two, and, in general, differences between the 
two populations will tend to reduce the number of runs Thus the two popula- 
tions may have the same mean or median, but if the x population i$ concentrated 
while the y population is dispersed, there will be a tendency to have a long y 
run on each end of the combined sample, and there will thus be a tendency 
to reduce the number of runs A test then is performed by observing the total 
number of runs, say Z, in the combined sample and rejecting jf Z is Jess 
than or equal to some specified number z 0 Our task now is to determine the 


distribution ofZ undergo in order that for a given test size we may specify z„ 
If is true, it can be argued that the possible arrangements of the 
m x values and n y values arc equally likely It is clear that there are exactly 

( W * such arrangements To find P[Z <= z], it is necessary now to count 

all arrangements with exactly z runs Suppose z is even, say 2k, then there 
must be k runs of x values and k runs of y values To get k runs of x values, 
the m ads must be divided into k groups We can form these k groups, or runs, 
by inserting k - I dividers into the m - 1 spaces between the m x values with no 


more than one divider per space We can place the k — I dividers into the 
m - 1 spaces in ^ “ jj ways Similarly, we can construct the k runs of 

y values m ^ ~ J j w ays Any particular arrangement of the k runs of x values 

can be combined with any arrangement of the k runs of y values, furthermore, 
the first run in the combined arrangement can be either a run of x values or a 

run ofy values, hence there are a total of 2^ ” j j ^ “ jj arrangements having 
exactly z~2k runs Hence 


P[Z = z) = f[Z= 2*] = 



05 ) 



5 


equality of two distributions 521 


Similarly, for z odd 


P[Z = z] = P[Z = 2k + 1] = 


(V) 

("-1 

U-i 

M 

m - 1 \ 

fc-lj 

(V) 

( 

m -f n 

m 

/ 


(16) 


To test tfo with size of Type I error equal to a, one finds the integer z 0 so that 
(as nearly as possible) 


£p[Z = z]= a (17) 

i = 2 

and rejects Jf 0 if the observed value of Z does not exceed z 0 . 

The computation involved in Eq. (17) can become quite tedious unless 
both 772 and n are small. Fortunately, the distribution of Z is approximately 
normal for large samples, and in fact the approximation is usually good enough 
for practical purposes when both m and n exceed 10. If is true, the mean 
and variance of Z are 


and 


<f[Z] = 


2mn 
m + n 


+ 1 


(18) 


var [Z] = 


2mn(2mn — m — ti) 
(m + /i) 2 (m + n - 1) ' 


(19) 


The asymptotic normal distribution of Z under Jf 0 has mean and variance given 
in Eqs. (18) and (19). This asymptotic normal distribution can be used to 
determine the critical value z 0 for large samples. 

The run test is sensitive to both differences in shape and differences in 
location between two distributions. 


5.4 Median Test 

Let Xi, ..., X m be a random sample from F x ( • ) and Y u . . . , Y n be a random 
sample from F r ( ■ ). As in the previous subsection, combine the two samples, 
and order them. Let Zj < Z 2 < . . . < Z m+n be the combined ordered sample. 
The median test of ?f 0 - F x (u) = F y {u) for all u consists of finding the median, 
say z, of the z values and then counting the number of jc values, say m u which 
exceed z and the number of y values, say n u which exceed z. If Jd 0 is true, m t 
should be approximately m/2 and n 1 approximately n/2. We can use either the 
statistic or the statistic N t to construct the test. Let us use M, = number 
of AT’s which exceed Z, the median of the combined sample. If m + n is even, 
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there are exactly (m + «)/2 of the observations (combined x's and y’s) greater 
than the median of the combined sample (Since we have an even number of 
continuous random variables, no two are equal, and the median is midway be- 
tween the middle two ) It can be easily argued that 

W=m ‘ 1 “ / m + n \ 

\(m+ri)l 2} 

for m + n even and true A similar expression obtains for m + n odd 
Such a distribution can be used to find a constant k such that 

and our test is given by the following 

Reject J? o if and only if j M 2 — j j £ k 

Just as in the run test, an asymptotic normal distribution of A/ t can be 
derived, but we will not study it 


5 5 Rank-sum Test 

A very interesting nonparametric test for two samples was described by 
Wilcoxon and studied by Mann and Whitney Given two random samples 
X lt X 2 > , X m and 7,, , Y„ from populations with absolutely continuous 

c,d f ’s F x ( ) and ), respectively, one arranges the m + n observations in 
ascending order and then replaces the smallest observation by 1, the next by 2, 
and so on, the largest being replaced by m + n These integers are called the 
ranks of the observations Let T x denote the sum of the ranks of the m x values 

and T, the sum of the ranks of the n y values Note that T x + T, = = 

j-i 

(,m + n + l)(ra + n)/2, so is a linear function of T x We could base a test 
on either statistic T x or T, Let us use T x T x is linearly related to another 
statistic, which we denote by V Set 

V^ithr, M, ( 20 ) 

the number of times an X exceeds a Y For a given set of observations, let 
r 2 , r 2 , ,r m denote the ranks of the * values, and let *i, , x’ m denote the 
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ordered * values. Clearly exceeds (r, - 1) y-values, * a exceeds (r a - 2) y- 
valucs, and so on, and x m exceeds (r„ - m) y-values. Hence 


or 


« = , I 


tr~ 


m(m + 1 ) 


V = T X - 


m(m + 1) 


( 21 ) 


To find the first two moments of T x , we find the first two moments of U. 


= *EI W »,(*/)] = II 
-12 iTOa^-EEp-m* 

where 


P = P[Y, :> >}] = /P[y<A-|X = .v]/,W = J F Y {x)f x (x) dx. 

Ifjfo is true, 

P = / Fx(x)fx(x) dx = u du = f 

Similarly, the variance of U can be found. The derivation is somewhat more 
complicated since one needs the expected value of U 2 . From the mean and 
variance of U, the mean and variance of T x can be obtained. If ^f 0 is true, 
they are given by 


m(m + n + 1) 
6 M*J = ~ 


( 22 ) 


and 


var [T x ] = 


mn{m + n + 1) 
12 


(23) 


The exact distribution of T x turns out to be a very troublesome problem 
for large m and n. However, Mann and Whitney have calculated the distribu- 
tion for small m and n, have shown that T x is approximately normally distributed 
for large m and n, and have demonstrated that the normal approximation is 
quite accurate when m and n are larger than 7. Thus for samples of reasonable 
size one can use the normal approximation with mean and variance given by 
Eqs. (22) and (23) to find a critical region for testing F x (z ) = F y (z) for all z 

versus : F x (z) ^ F y (z). The test would be the following: 

Reject yf 0 if | T x — <f[T x ] | is large; 



524 NONTAJUMTTJUC MITUODS 


that is. 

Reject if and only ir \T M - 

where k ts determined by fixing the size of the tot and using the asymptotic 
normal distribution of 7^ 


EXAMPLE 5 Find the exact distribution of T„ under ■#'<> for m -3 and 
n - 2 Each of the following arrangements is equally likely if jf 0 a 
true 

xxxyy,xxyx},xxyyx,xyxxy,xyxyx t 
xyyxx t yxxx} t yxxyx,yxyxx,yyxxx 
The corresponding T, values arc, respectively, 6, 7, 8, 8,9, 10,9, 10, II, 12, 
so 

pit, - 8] - rjr, - 9] - nr, - 10] - ■*. 
and 

p{r J -u)-nr,-.i2]» 1 «ff fill 

PROBLEMS 

/ Show that F- •• £ is *n unbiased estimator of fl^c £1 find vir 173. 
and show (hat T Is a mean-squared-error consistent estimator of FIX e J5J. 

2 Define FJ{D,) - - £ /,/JTJ for/- J. 2. find cov FXD t )l 

3 Let Yi, , Y, be the order statistics corresponding to a random sample of size is 
from a continuous cd f ft ) 

(а) Find the density of F\ Y t ) 

(б) Find the joint density of F(Fi) and F{Yft 

(e) Find the density of [F( K) - F{ >',)1[F( J'J - F{ r,)J. 

4 Let Xu . X. be Independent and identically distributed random variables 

having common continuous c.d.f F( ) Let V» < < >". be the corre- 

sponding order statistics and define F/ ) to be the sample cdT Set D, *■ 

— iw*>-/wi 

(a) Find the exact distribution of D. for * — 1 

(b) Do the same for it — 2. Iltvt Does/). -max tni*i) i-FfF.) EfE*)- 1. 

i~nr»))7 

(c) Argue that the exact distribution of D. will not depend on F( ) 
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5 Show that the expected value of the larger of a random sample of two observations 
from a normal population with mean 0 and unit variance is \jVrr and hence that 
for the general normal population the expected value is p -f- ct/Vtt. 

6 If (X> Y) is an observation from a bivariate normal population with means 0, unit 
variances, and correlation />, show that the expected value of the larger of A" and Y 
is VCI — ja)/TT* 

7 We have seen that the sample mean for a distribution with infinite variance (such 
as the Cauchy distribution) is not necessarily a consistent estimator of the popula- 
tion mean. Is the sample median a consistent estimator of the population median ? 

8 Construct a (approximate) 90 percent confidence band for the data of Example 1. 
Does your band include the appropriate uniform distribution? 

9 Let Y x < " * < Y s be the order statistics corresponding to a random sample from 
some continuous c.d.f. Compute P[Y, <£. 50 < Y,] and P[Y 2 <£ 50 <Y*1 
Compute P[ Yt < f 20 < Y 2 \ Compute P[Y 3 < f „ < Y s ). 

10 Let Y t and Y» be the first and last order statistics of a random sample of size n 
from some continuous c.d.f. F(*). Find the smallest value of n such that 
P[F(y.)-i r O'i)^.75]^.90. 

11 Test as many ways as you know how at the 5 percent level that the following two 
samples came from the same population: 


x ! 

1.3 

1.4 

1.4 

1.5 

1.7 

1.9 

1.9 

y 

1.6 

1.8 

2.0 

2.1 

2.1 

2.2 

2.3 


12 Let X lt X* denote a random sample of size 5 from the density f(x; 6) = 

#+*)(x). Consider estimating 0. 

(a) Determine the confidence coefficient of the confidence interval ( Y u Y 5 )- 
(jb) Find a confidence interval for 6 that has the same confidence coefficient as in 
part (a) using the pivotal quantity (Yi 4- Fj)/2 — 0. 

(c) Compare the expected lengths of the confidence intervals of parts (a) and (Jb). 

13 Find var [U] when F*(*) = Fr(’)« See Eq. (20). 

14 Equation (21) shows that U and T x are linearly related. Find the exact distribution 

of U or T x when ffo is true for small sample sizes. For example, take m — 1, 
n=2;/n=l,n = 3;m = 2, n= 1; rrt= 3, 1; and m = n = 2. 

15 We saw that £[U] = mnp. Is U/mn an unbiased estimator of p = P[Xi^Yj] 
whether or not 0 is true? Is U a consistent estimator of j?? 

16 A common measure of association for random variables X and Y is the rank 
correlation , or Spearman's correlation . The X values are ranked, and the observa- 
tions are replaced by their ranks; similarly the Y observations are replaced by their 
ranks. For example, for a sample of size 5 the observations 


X 

20.4 

19.7 

21.8 

20.1 

20.7 

y 

9.2 

8.9 

11.4 

9.4 

10.3 
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are replaced by 


r(x) 

3 1 

5 

2 4 

r( y) 

2 1 

1 5 

3 4 


Let r(Xt) denote the rank of X> and r(J'i) the rank of Yi Using these paired 
ranks, the ordinary sample correlation is computed 


Spearman’s correlation = S • 


VX ['(* .) - r(*)]* 2 WW ~ fCW ’ 


where r (X) = 2 ''<■*'.)/« and f ( 10 *= 2 K Y,)!n 

(a) Show that 5= 1 - 6 2 J>VC»* - "), where D, = r(X,) - r( Y t ) 

(ft) Compute the ordmary correlation and Spearman's correlation for the above 
data 

17 Argue that the distribution of S m Prob 16 is independent of the form of the 
distributions of X and Y provided that X and Y are continuous and independently 
distributed random variables Hence S can be used as a test statistic in a non- 
parametric test of the null hypothesis of independence 

18 Show that the mean and variance of S (in Prob 17) under the hypothesis of inde- 
pendence are 0 and 1 /(»— I), respectively 
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1 INTRODUCTION 

The purpose of this appendix is to provide the reader with a ready reference to some 
mathematical results that are used in the book. This appendix is divided into two 
main sections: The first. Sec, 2 below, gives results that are, for the most part, com- 
binatorial in nature, and the last gives results from calculus. No attempt is made to 
prove these results, although sometimes a method of proof is indicated. 


2 NONCALCULUS 


2,1 Summation and Product Notation 

A sum of terms such as n 3 + ju + n s 4- n 6 -f n 7 is often designated by the symbol 

2m. 2 the capital Greek letter sigma, and in this connection it is often called the 

summation sign . The letter / is called the summation index . The term following 2 is 
called the summand. The “ / = 3 ” below 2 indicates that the first term of the sum is 
obtained by putting /== 3 in the summand. The “7” above the 2 indicates that the 
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final term of the sum is obtained by putting i = 7 in the summand The other terms 
of the sum are obtained by giving i the integral values between the limits 3 and 7 Thus 

£ (- IK- 2x* - 3x* + 4*« - 5x'° 

An analogous notation for a product is obtained by substituting the capital 
Creek letter TJ for 2 ^ this case the terms resulting from substituting the integers 

for the index are multiplied instead of added Thus 

EXAMPLE 1 Some useful formulas involving summations are listed below 
They can be proved using mathematical induction 


i n(« + l) 

i-i 2 

(1) 

n- *0*+ l)(2 fl + 1) 

(2) 

•?> j W«+1>V 

(3) 


a n{n + l)(2fl + l)(3n* + 3n — 1) 
i-» ‘ " 30 

(4) 


Equation (1) can be used to derive the following formula for an arithmetic series 
or progression 

iZ t [a + (j-l)d\=na + jn(n-l) (5) 

A companion series the finite geometric senes, or progression, is given by 

i o or 1 = a (6) 

I III 

1 2 Factorial and Combinatorial Symbols and Conventions 

A product of a positive integer n by all the positive integers smaller than it is usually 
denoted by n* (read * n factorial *) Thus 

n! = n(n— l)(n — 2) l-fl <«“-/) 

/•» 


O’ is defined to be 1 


( 7 ) 



0 


NONCALCULUS 529 


A product of a positive integer ;i by the next k - 1 smaller positive integers is 
usually denoted by (/;)* . Thus 

(«)*=«(«- 1) (n-k+l) 


II (« -/+!). 


( 8 ) 


Note that there arc k terms in the product in Eq. (8). 


Remark («)t = «!/(« — A)!, and («) B — «!/0! = «!. The combinatorial symbol 

(3 


is defined as follows: 


'n\ (n)i nl 


( 9 ) 


\Jc) k\ {n — k)\k\ ’ 
is read “combination of n things taking k at a time” or more briefly as 


(3 

u n pick k it is also called a binomial coefficient . Define 

^ = 0 if k < 0 or k> n. 

Remark 


"'-n-i. 


(3-C-3- 

(T)-(3 + (*-i) “ ,i ‘' 0,±,,±v 

Equation (11) is a useful recurrent formula that is easily proved. 


( 10 ) 

//// 


( 11 ) 

III! 


Both (ri)t and the combinatorial symbol f j can be generalized from a positive integer 
n to any real number t by defining 


(r)» = r(r-l) (/-*+ D. 


A r(/-l) (/-* + » 

k\ 

for k — 1, 2, . . . , 


( 12 ) 


and LJ = 1 for A = 0. 
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f-n\ (-«)(-/! -1) (-«-*+ 1) 

Remark I k I ^ 

, _»(« + 1 ) (« + *-» 

it! 



2 3 Stirling’s Formula 

In finding numerical values of probabilities, one is often confronted with the evaluation 
of long factorial expressions which can be troublesome to compute by direct multiplica 
tion Much labor may be saved by using Stirling's formula which gives an approximate 
value of r 1 Stirling’s formula is 


R»w(2w)*e-V* (13) 

or 

— (2a) 1 » - W* (14) 

where 1 — l/(12n+ 1) <r(n) < I To indicate the accuracy of Stirling $ formula 10' 
was evaluated using five-place logarithms and Eq (13), and 3,599,000 was obtained. 
The actual value of 101 is 3,628,800 The percent error is less than 1 percent, and the 
percent error will decrease as n increases 


2 4 The Binomial and Multinomial Theorems 
The binomial theorem is often given as 

as) 

for n, a positive integer The binomial theorem explains why the are sometimes 
called binomial coefficients Four special cases are noted in the following remark. 


Remark 


« + '>"ZC>' 

a-0’“ i o Q(-iw 

2 -a0- 


(16) 

(17) 


(18) 



3 


CALCULUS 531 


and 


Expanding both sides of 


= 2 
Je 0 



09) 

mi 


(i+x)v+xy=(i+xy +b 

and then equating coefficients of x to the nth power gives 



a formula that is particularly useful in considerations of the hypergeometric distribution. 
A generalization of the binomial theorem is the multinomial theorem , which is 



i~i 


where the summation is over all nonnegative integers n u rh, /;* which sum to n . 
A special case is 

Also note that 

/ m \/ n \ n n 

( 2 flill 2 bj) = 2 2 o,bj. (23) 

\l=l /V»l / M7«=I 


3 CALCULUS 


3.1 Preliminaries 

J v \ 

It is assumed that the reader is familiar with the concepts of limits, continuity, differenti- 
ation, integration, and infinite series. A particular limit that is referred to several 
times in the book is the limit expression for the number e\ that is, 

lim(l+x)'"=e. (24) 


Equation (24) can be derived by taking logarithms and utilizing I’Hospital’s rule. 
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which is reviewed below There are a number of variations of Eq (24), for instance, 
hm(I +x-‘)*=e (25) 

and 

(29 


Jim (J + Ax) 1 " => e* for constant A. 


A rule that is often useful m finding limits is the following so-called P Hospital’s 
rule If/{ ) and ) are functions for which Iim/(x) =» hm j(x) *= 0 and if 


exists, then so does 


and 


-ffW 


m 

— S(x) 


m 

-gix) 


hm 


fXx) 
9 (*) 


EXAMPLE 2 Find hm RJ/x) log. (1 + *)} Let f(x) = Jog. (1 + *) and ff(x ) - x, 
then 

tens -Srb" 1 "!'.™ &*»H- "" 

Another rule that we use in the book is Leibniz' rule/or differentiating an integral 
Let 

where /( , ), g{ ), and h{ ) are assumed differentiable Then 

Several important special cases derive from Leibniz’ rule, for example, if the 
integrand /(x, t) does not depend on t, then 

5 [C /w *]" /w ®S" /wn) S’ C!) 

in particular, if g(t) is constant and h{t) » t, Eq (28) simplifies to 
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3.2 Taylor Series 

The Taylor series for f(x) about x = a is defined as 

rr \ rr\^nnr\r \ , / <2, (<0 (* — tf) 2 f M {a){x — d) n 

f{x) —f(a) +/»>(«)(* -o)+ + • • • + + R a , (30) 

n\ 

where 


/< n (a) = 




Rn = 


f' n+ ”(c)(x-ay 

(n + 1)! 


and a < c x. 


is called the remainder . /(x) is assumed to have derivatives of at least order /z + 1. 
If the remainder is not too large, Eq. (30) gives a polynomial (of degree n) approxima- 
tion, when R a is dropped, of the function /(•)• The infinite series corresponding to 
Eq. (30) will converge in some interval if Iim R„ = 0 in this interval. Several important 

n~»co 

infinite Taylor series, along with their intervals of convergence, are given in the follow- 
ing examples. 


EXAMPLE 3 Suppose/(x) = e*anda = 0. Then 


jc z x 3 

£’=1+* + — + — +• 



® X J 

= 2 — for — co <x < co. (31) 

j * o j\ 

IIII 


EXAMPLE 4 Suppose /(*) = ( 1-x) 1 and c = 0; then / (1> (x) = —/(l — x)* -1 , 

/<»(*) =/(r- do - x y- 2 , f u Kx) =(— iy/(/ - 1) (/— y+ w 

and hence 

/(*) = (l-*)'= 2 (-DWjt! 

J=o J J 

= 2 o ^(-*) ; for — 1 <*< 1. (32) 

/ " v Jl IIII 

There are several interesting special cases of Eq. (32). t — —n gives 
( l_^-=J^- n j(_*y=| o (" + ^“ 1 )^ for —1 <* < 1; (33) 

/ = — 1 gives the geometric series 

(1 -*) _l = f xJ ’> 

J = 0 


( 34 ) 
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—2 gives 


( 1 -*)-*= | U+l)x>. (35) 


EXAMPLE 5 Suppose /(x) = log, (1+ x) and a = 0, then 

Jfl 

log,(l+x) = x-y + ~- j+ ••• for —I <*£ I. (36) 

llll 

The Taylor series for functions of one variable given in Eq (30) can be generalized 
to the Taylor senes for functions of several variables For example, the Taylor series 
for f(x, y) about x = a and y = b can be written as 
f{x,y) =f(a, b) +4(o, b){x - o) +/,(fl, b){y - b) + 

— 14,(0, b){x - o) 1 + 24 ,(o, b)(x - a)(y - b) +f„(a, b)(y - 6)*] + * - , 


where 


fjp, b) 


dx 


and similarly for the others 


Up, b) 


J!L 


3.3 The Gamma and Beta Functions 

The gamma function, denoted by T( ■ ), is defined by 


T(0=J Jf-'e-dx forr>0 (37) 

T(0 is nothing more than a notation for the definite integral that appears on the right- 
hand side of Eq (37) Integration by parts yields 


1 and, hence, if/ = n (an integer), 


r(r+]) = rr(o, (38) 


r(» +!)=.«!. ( 39 ) 


If n is an integer, 


V^, 


T(n+i) 


1-3-5 (2o — 1) 


(40) 
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and, in particular, 


T(£) = 2T(J) = Vtt. 


The beta function, denoted by B( • , • ), is defined by 


Bfa b) = f **“‘(1 “ dx for a > 0, b > 0. 
Jo 


(41) 


(42) 


Again, B(a, b ) is just a notation for the definite integral that appears on the right-hand 
side of Eq. (42). A simple variable substitution gives B(o, b) = B(6, a). The beta 
function is related to the gamma function according to the following formula: 


B(o, b) = 


mm 

T(a + b)' 


(43) 



APPENDIX B 

TABULAR SUMMARY OF PARAMETRIC FAMILIES 

OF DISTRIBUTIONS 


1 INTRODUCTION 

The purpose of this appendix is to provide the reader with a convenient reference 
to the parametric families of distributions that were introduced in Chap. III. Given 
are two tables, one for discrete distributions and the other for continuous distribu- 
tions. 
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Table 1 DISCRETE DISTRIBUTION'S 


Name of 
parametric 
family of 
distributions 

Parameter Mean 

Discrete density functions /f J space (i — 4{X\ 

Discrete 

uniform 

/Ec)-i/,i a *i« A'- 1.2,... £±1 

Bernoulli 

f(x)‘-p’q l ~’J i o „(*) 0<p£l p 

Binomial 

/W-£)rt~W ,»iW «p 

"**1.2.3,... 
fa-l-p) 

Hypcrgzomeln c 

©«:?) 

pj A..., ...W *£ 

K- 0,1, Af 
n~ 1.2, ...,Af 

' Poisson 

/w “Sr' 10 - 1 ' > w A>0 A 

Geometric 

iW 0<p^l 2 

fo-l-p) * 

Negative 

binomial 

/w-f+rv^,. jw o<j»£i ~ 

\ * / r> 0 p 
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Table 2 CONTINUOUS DISTRIBUTIONS 


Name of 
parametric 
family of 
distributions 

Cumulative distribution function F ( ) Parameter Mean 

or probability density function /() Space }i~£[X] 

Uniform or 
rectangular 

/W-jijJfc.nM — co<<j<6<co 

Normal 

/w = vfc MPl_Cx -^' 2 ^ f* 

Exponential 

/{*) = AiU"/(. ->(*) A > 0 j- 

Gamma 


Beta 

6>o° rfi 

Cauchy 

' » « J — co<«<<» Does not 

/W *r/Sll + tt*-«)/W) P>0 exist 

Lognormal 

/W- -«<ft<co exp^+id 1 ! 

1 

j (V2wo e *Pt-0°&*~fOV2»*J/<« ®)W 

Double 

exponential 

j = 2£ £V <w * 
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Variance 

a 1 = <?[(*- M ) 2 ] 

Moments /x' — <?[X r ] 
or fi, = if [(A"— ft) r ] 
and/or cumulants k. 

Moment generating 
function S[c tx ] 

(b -aY 

fir ~ 0 for r odd 

e"‘ - e“ 

12 

*~2 ' ( r +1) f0rrCVCQ 

(6 — a)t 

a 2 

& ~ 0- r odd; /t, = ^ 2 „ 2 , r even; 
k, = 0, r >2 

exp[fi/ + i<r J t z ] 

1 

A 1 

„-- r(r+1 > 

A r 

A r x 

r for/<A 

A- / 

r 

A 2 

u'- r(r+y) 

N A J r(r) 

fe) r for ' <A 

ab 

B(r + a f b ) 

not useful 

(a + b + \)(a + b) 1 

Mr= B(o, 6) 

Docs not exist 

Do not exist 

Characteristic function 
is 

exp[2/x -f 2 a 2 ] 

— exp[2/x-f 2a 2 ] 

fir = cxp[r/x 4- i r 2 cr 2 ] 

not useful 

2/? 2 

fir — 0 for r odd; 
fi r = r\ for r even 

i — (fiO 1 


(i continued ) 







542 PARAMETRIC FAMILIES OF DISTRIBUTIONS 


APFPENDIX B 


Tabic 2 CONTINUOUS DISTRIBUTIONS (continued) 


Name of 
parametric 
family of 
distributions 

Cumulative distribution function f{) 
or probability density function /X ) 

Parameter 

Mean 

p=*r*i 

Weibull 

/M - abx*~ 'exp[-«‘]/( 0 .,(*) 

O2»0 

i ^0 

a- l/ ‘r(l+4-i) 

Logistic 


-00< *< 00 
p> 0 

« 

Pareto 

Bxl 

/to 

>0 

0 > 0 

0*0 

0-1 

for 5 > i 

Gumbel or 
extreme value 

FCx)~ exp (-«-<*-«») 

“«< «< 00 
/S > 0 

« + 0y. 
y~ 577216 

t distribution 

?l{k + \)P] 1 1 


pH 

Tlk/2) Vkir (l+AW* tl, ' a 

fc>0 

for k> 1 

F distribution 

/(*) - r ^ m + 

m,n- 1,2, 

n — 2 

for ji > 2 

Chi square 
distribution 


*=1,2. 

fc 
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Variance 

cr 2 = <?[(*- M ) 2 ] 

Moments ft' r — 
or /*r = <?[(*- ^) p ] 
and/or cumulants K r 

Moment generating 
function 

a-«*[ra+2fc-‘) 

-r j a+6-‘)] 



P 2 7T 2 

3 



€ at TTp( CSC(7Tj8/) 

ON 

, 0-*S 

for 9 > r 

does not 
exist 

(0 -me- 2 ) 

for 0 > 2 

P'-B-r 

7T 2 j3 2 

k, = (— p)'4^'~ n (l) for r > 2, 

e*T(l-pt) 

6 

where ^r(0 is digamma function 

for t < 1//5 


— 0 for k> r and r odd 


k 

A" 2 B((/-+ l)/2, (k-r)l2) 

does not 

k—2 

f*r~ 

B(|, */2) 

exist 

for k > 2 

for k> r and r even 


2n 2 (m + « - 2) 

. (n\'T(ml2+r)T(nl2-r) 

does not 

m(/i - 2) 2 (n - 4) 


r(m/2)r(n/2) 

exist 

for /i > 4 

„ n 

for r < - 




, vrm+j) 

i i v ' 2 

2k 

r(*/2) 

\1 - 2// 




for r< 1/2 
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TABLES 


1 DESCRIPTION OF TABLES 


Table 1 Ordinates of the Normal Density Function 
This table gives values of 


for values of x between 0 and 4 at intervals of 01 
the fact that $(—x) = ${x) 


For negative values of x one uses 


Table 2 Cumulative Normal Distribution 
This table gives values of 

^ dt - £_ #) dt 

for values of x beteenn 0 and 3 5 at intervals of 01 For negative values of*, one uses 
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the relation $(— x) — 1 — $(*). Values of x corresponding to a few special values of <2> 
are given separately beneath the main table. 


Table 3 Cumulative Chi-Square Distribution 

This table gives values of u corresponding to a few selected values of F{u\ where 

" J 0 2 ,n T(nj2) 

for n, the number of degrees of freedom, equal to 1 , 2, . .. , 30. For larger values of n, 
a normal approximation is quite accurate. The quantity V2 u — Vl/i — l is nearly 
normally distributed with mean 0 and unit variance. Thus u ai the ath quantile point 
of the distribution, may be computed by 


U' = 10 . + V2n-D\ 


where z a is the ath quantile point of the cumulative normal distribution. As an illustra- 
tion, we may compute the .95 value of u for n ~ 30 degrees of freedom: 

u.9 5 = K1.645 + V59) 1 
= 43.5, 


which is in error by less than 1 percent. 


Table 4 Cumulative F Distribution 


This table gives values of F corresponding to five values of 



fr 1 )- 






for selected values of m and n\m is the number of degrees of freedom in the numerator 
of F, and n is the number of degrees of freedom in the denominator of F, The table 
also provides values corresponding to G — .10, .05, .025, .01, and .005 because F z _« 
for m and n degrees of freedom is the reciprocal of F K for n and m degrees of freedom. 
Thus for G = .05 with three and six degrees of freedom, one finds 


FosO, 6) — 


1 


F. 95 (6,3) 8.94 


= ^ = .112 


One should interpolate on the reciprocals of m and n as in Table 5 for good accuracy. 
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Table 5 Cumulative Students t Distribution 

This table gives values of t corresponding to a few selected values of 


F(0“/ b 


r (^) 


r (i)^K) 


with n = 1,2, . ,30, 40, 60, 120, oo Since the density is symmetrical in t, it follows 
that F(— f) = l— F(t) One should not interpolate linearly between degrees of 
freedom but on the reciprocal of the degrees of freedom, if good accuracy m the last 
digit is desired As an illustration, we shall compute the 975th quantile point for 
40 degrees of freedom The values for 30 and 60 are 2 042 and 2 000 Using the 
reciprocals of n, the interpolated value is 

2 042 — -p-—— - (2.042 — 2 000) *= 2 02 1 , 

« — A 

which is Ihe correct value Interpolating linearly, one would have obtained 2 023 
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Tabic 1 ORDINATES OF THE NORMAL DENSITY FUNCTION 

= ~ c~* l « 

V 2ir 


X 

' .00 

i 

I 

.01 

.02 

.03 

.04 

.05 

.06 

.07 

: .08 

.09 

.0 

.3989 

.3989 

.3989 

.3988 

,3986 

.3984 

.3982 

.3980 

.3977 

.3973 

.1 

,3970 

.3965 

.3961 

.3956 

.3951 

.3945 

.3939 

.3932 

.3925 

.3918 

.2 

.3910 

.3902 

.3894 

.3885 

.3876 

.3867 

.3857 

.3847 

.3836 

.3825 

.3 

,3814 

.3802 

.3790 

.3778 

.3765 

.3752 

.3739 

.3725 

.3712 

.3697 

.4 

.3683 

.3668 

! 

.3653 

.3637 

.3621 

.3605 

.3589 

.3572 

.3555 

.3538 

,5 

.3521 

.3503 

.3485 

.3467 

.3448 

.3429 

.3410 

.3391 

.3372 

.3352 

.6 

.3332 

.3312 

.3292 

.3271 

.3251 

.3230 

.3209 

.3187 

| .3166 

.3144 

.7 

.3123 

.3101 

.3079 

.3056 

.3034 

.3011 

.2989 

.2966 

.2943 

.2920 

.8 

.2897 

.2874 

.2850 

.2827 

.2803 

.2780 

.2756 

.2732 

; .2709 

.2685 

.9 

.2661 

.2637 

.2613 

.2589 

.2565 

.2541 

.2516 

.2492 

.2468 

.2444 

1.0 

.2420 

.2396 

.2371 

.2347 

.2323 

.2299 

.2275 

.2251 

; .2227 

.2203 

1.1 

.2179 

.2155 

.2131 

.2107 

.2083 

.2059 

.2036 

.2012 

.1989 

.1965 

1.2 

.1942 

.1919 

.1895 

.1872 

.1849 

.1826 

.1804 

.1781 

; .1758 

.1736 

1.3 

.1714 

.1691 

.1669 

.1647 

.1626 

.1604 

.1582 

.1561 

! .1539 

.1518 

1.4 

.1497 

.1476 

.1456 

.1435 

.1415 

.1394 

.1374 

.1354 

i .1334 

.1315 

1.5 

.1295 

.1276 

.1257 

.1238 

.1219 

.1200 

.1182 

.1163 

! .1145 

.1127 

1.6 

.1109 

.1092 

.1074 

.1057 

.1040 

.1023 

.1006 

I .0989 

! .0973 

1 .0957 

1.7 

.0940 

;0925 

.0909 

.0893 

.0878 

.0863 

.0848 

! .0833 

.0818 

; .0804 

1.8 

.0790 

,0775 

.0761 

.0748 

.0734 

.0721 

.0707 

! ,0694 

,0681 

i .0669 

1.9 

.0656 

.0644 

.0632 

.0620 

.0608 

.0596 

.0584 

i .0573 

.0562 

| .0551 

2.0 

.0540 

.0529 

.0519 

.0508 

.0498 

.0488 

.0478 

! .0468 

.0459 

,0449 

2.1 

.0440 

.0431 

.0422 

.0413 

.0404 

,0396 

.0387 

| .0379 

.0371 

.0363 

2.2 

.0355 

.0347 

.0339 

.0332 

.0325 

,0317 

.0310 

,0303 

.0297 

,0290 

2.3 

,0283 

.0277 

,0270 

.0264 

.0258 

.0252 

.0246 

.0241 

.0235 

.0229 

2.4 

.0224 

.0219 

,0213 

.0208 

.0203 

.0198 

.0194 

.0189 

.0184 

.0180 

2.5 

.0175 

.0171 

.0167 

.0163 

.0158 

.0154 

.0151 

.0147 

.0143 

.0139 

2.6 

.0136 

.0132 

.0129 

,0126 

.0122 

.0119 

.0116 

,0113 

.0110 

.0107 

2.7 

.0104 

.0101 

.0099 

.0096 

.0093 

.0091 

.0088 

.0086 

.0084 

.0081 

2.8 

.0079 

.0077 

.0075 

.0073 

.0071 

.0069 

.0067 

.0065 

.0063 

.0061 

2.9 

.0060 

.0058 

.0056 

.0055 

.0053 

.0051 

.0050 

.0048 

.0047 

.0046 

3.0 

.0044 

.0043 

.0042 

.0040 

.0039 

.0038 

.0037 

.0036 

.0035 

.0034 

3.1 

.0033 

.0032 

.0031 

.0030 

.0029 

.0028 

.0027 

.0026 

.0025 

.0025 

3.2 

.0024 

.0023 

.0022 

.0022 

.0021 

.0020 

.0020 

.0019 

.0018 


3.3 j 

.0017 

.0017 

.0016 

.0016 

.0015 

.0015 

.0014 

.0014 

.0013 


3.4 

i 

.0012 

.0012 

,0012 

.0011 

.0011 

.0010 

.0010 

.0010 

.0009 

.0009 

3.5 I 

.0009 

.0008 

.0008 

.0008 

.0008 

.0007 

.0007 

.0007 

.0007 

.0006 

3.6 

.0006 

.0006 

.0006 ! 

.0005 

.0005 

.0005 

,0005 

.0005 

.0005 

.0004 

3.7 

.0004 
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•This tabic is abridged from “Tables of percentage points of the incomplete beta function and of the chi-sqilarc distribution,” Biomclrika % 
Vol, 32 (1941). It is here published with the kind permission of its author, Catherine M, Thompson, and the editor of Biometrika. 
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* This (able is abridged from “Tables of percentage points of the inverted beta distribution,” Blometrika , Vol, 33 (1943). It is here published 
with the kind permission of its authors, Maxine Merrington and Catherine M. Thompson, and the editor of Biometrika. 
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Absolutely continuous, 60, 61, 63, 64 
Admissible estimator, 299 
Algebra of sets, 18, 22 
Analysis of variance, 437 
A posteriori probability, 5, 9 
A priori probability, 2-4, 9 
Arithmetic scries, 528 

Asymptotic distribution, 196, 256-258, 261, 359, 
440, 444 

Average sample size in sequential tests, 470-^72 

BAN estimators, 294, 296, 349, 446 
Bayes estimation, 339, 344 
Bayes* formula, 36 
Bayes risk, 344 
Bayes lest, 417 

Bayesian interval estimates, 396, 397 
Bernoulli distribution, 87, 538 
Bernoulli trial, 88 
repeated independent, 89, 101, 137 
Best linear unbiased estimators, 499 
Beta distribution, 1 1 5, 540 
of second kind, 215 
Beta function, 534, 535 
Bias, 293 

Binomial coefficient, 529 
Binomial distribution, 87-89, 119, 120, 538 
confidence limits for p t 393, 395 
normal approximation, 120 
Poisson approximation, 119 
Binomial theorem, 530 
Birthday problem, 45 
Bivariate normal distribution, 162-168 
conditional distribution, 167 
marginal distribution* 167 
moment generating function, 164 
moments, 165 
Boole’s inequality, 25 


Cauchy distribution, 1 17, 207, 238, 540 
Cauchy-Schwarz inequalities, 162 
Censored, 104 

g entral-limit theorem, 111, 120 , 195,233,234, 
‘ ~~ ’ 

Centroid, 65 
Chebyshev inequality, 71 
multivariate, 172 

Chi-square distribution, 241, 542, 549 
table of, 553 

Chi-square tests, 440, 442-461 
contingency tables, 452-461 
goodness-of-fit, 442, 447 
Combinations and permutations, 528 
Combinatorial symbol, 528 
Complement of set, 10 


Complete families of densities, 321, 324, 354 
Complete statistic, 324 
Completeness, 321, 354 
Composite hypothesis, 418 
(See also Hypotheses) 

Concentration, 289 
Conditional distributions, 129, 148 
bivariate normal, 167 
continuous, 14G, 147 
discrete, 143-145 
Conditional expectation, 157 
Conditional mean, 158 
Conditional probability, 32 
Conditional variance, 159 
Confidence bands for c.d.f., 511 
Confidence coefficient, 375, 377, 461 
Confidence intervals, 373, 375, 377, 461 
c.d.f.,511 

difference in means, 386 
general method for, 389 
large sample, 393 

mean of normal population, 375, 381, 384 
median, 512 

method of finding tests, 425, 461 
one-sided, 378 

p of binomial population, 393, 395 
pivotal method of obtaining, 379, 387 
regression coefficients, 491-494 
uniformly most accurate, 464 
variance of normal population, 382, 384 
Confidence limits [see Confidence interval (s)] 
Confidence region, 377 
for mean and variance of normal population, 
384 

Confidence sets, 461 
uniformly most accurate, A 64 
Consistency of an estimator, 291, 294,295, 359 
Contagious distribution, 102, 122, 123 
Contingency tables, 452-461 
interaction, 454 
tests for independence, 452 
Continuous distributions, 60, 62 
( See also Distributions) 

Continuous random variable, 60 
Convex function, 72 
Convolution, 186 
Correlation, 155, 161 
sample, 526 

Spearman’s rank, 525, 526 
Correlation coefficient, 155, 156 
Covariance, 155, 156 
of two linear combinations of random 
variables, 179 

Covariance matrix, 352, 489 
Cramer-Kao inequality, 316 
Cramer-Rao lower bound, 316, 320 
Critical function, 404 
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Critical region, 403 
size of, 407 

Cumuiint generating function, 80 

Cumufanls, £0 

Cumulative distribution function (s). 54, 56, 63, 
130, 132, 144 
bivariate, 132 
decomposition tif, 63 
empirical, 264, 206 
joint, 130 

sample, 264, 287,506,511 
unidime nsional, 24, 56 


Decision theory, 297, 343,415 
Deductive inference, 220 
Degenerate distribution, 258 
Degrees of freed on, 242,246 
De Moivre-Laplace limit theorem, 120 
De Morgan slaw, 13, 14 
Density functions, 62 
d-serete, 57, 135 
joint discrete, 133 
joint probability, 138 
probability, 60,62 
(5>e also Distributions) 

Difference between means 
confidence intervals, 386 
tests, 432 

Differentiation, 532 
Discrete distributions, 57, 133, 144, 145 
(See also Distributions) 

Discrete random variables, 57, 63, 133 
Disjoint, 14 
Distance function, 287 

Distribution free tests, 505, 509, 514, 518-524 
Distribution funcuons, 54, 56, 63, 130, 132 
of difference, 185 
of maximum, 182 
of minimum, 1 82 
of order statistics, 25 1 
of product, 187 
of quotient, 187 
of sums, 185 
Distributions* 

asymptotic, 196, 226-264,359, 440,444 

Bernoulli, 87, 236, 538 

beta, 115, 540 

beta binomial, 104 

binomial {see Bmomial distribution) 

bivjnate normal ) 62-168 

Cauchy, 117, 207,238, 540 

chi-square, 24 1 , 542, 549 

conditional (see Conditional distributions) 

contagious. 102, 122, 123 

continuous, 60, 62 

cumulative Dee Cumulative distribuUon 
function(s)] 


Distributions - 
degenerate, 258 
discrete, 58, 60 
discrete uniform, 86, 533 
double exponential, 117, 540 
exponential, 111,121.237,262, 540 
F. 246. 247, 542, 549, 554 
gamma. 111, 123, 540 
geometric, 99, 538 
Gumbel, 118, 542 
hypergeometnc, 91, 538 
joint, 130,133, 138 
lambda^ 128 
Laplace, 117 

limiting, 196, 258, 261 , 444, 446, 507-509 
linear function of normal variates, 194 
logarithmic, 105 
logistic, 118,540 
lognormal, 117, 540 
marginal, 132, 135, 141 
Maxwell, 127 
multinomial, 137 
covariance for, 196 
muluvariate, 129-174 
negative binomial, 99, 538 
negative hypergeometnc, 213 
normal (gee Normal distribution) 
order statistics, 25 1 , 254 
Pareto, 118, 542 
Pearsoman system, 118-119 
Poisson, 93, 104, 119-121, 123,236,538 
pnor, 340, 417 
i distribution, 127 
Rayleigh, 127 
rectangular, 105,238,540 
sample, 224 

Student's t, 249, 250, 542 , 556 
symmetric, 170 
table of, 538-543 
truncated, 122 

Tukey's symmetrical lambda, 128 
uniform, 105,238, 540 
continuous, 105, 540 
discrete, 86,538 
variance ratio, 246, 437,438 
Weibull.U7.542 
{See also Sampling distributions) 


Efficiency, 291 

EBipsoid of concentration, 353 
Empty set, 10 
Equivalent sets, 10 
Error. 

mean-squared, 291 
sire of. 405 
Type 1, 405 
Type If, 405 
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Estimation: 
interval, 372 
point, 271 

Estimator(s), 272, 273 
admissible, 299 
BAN, 294,296 
Bayes method, 286, 339, 344 
best linear unbiased, 499 
better, 299 
closeness, 288 
concentrated, 289 
consistent, 294, 295, 359 
ellipsoid of concentration, 353 
least squares, 286, 498 
location invariant, 334 
maximum likelihood (see Maximum likeli- 
hood estimators) 
mean-squared error, 291 
method of moments, 274 
minimax, 299, 350 
minimum chi-square, 286, 287 
minimum distance, 286, 287 
Pitman, 290, 334, 336 
scale invariant, 336 
unbiased, 293, 315 

uniformly minimum-variance unbiased, 315 
Wilks’ generalized variance, 353 
(See also Large samples) 

Event, 14, 15,18,53 
elementary, 15 
Event space, 15, 18, 23 
Excess, coefficient of, 76 
Expectation, 64, 69, 153, 160, 176 
Expected values, 64, 69, 70, 129, 153, 160 
conditional, 157 

of functions of random variables, 176 
properties of, 70 

Exponential class, 312, 313, 320, 326, 355, 422 
Exponential distribution, 111,121, 237, 262, 540 
Exponential family, 312, 313, 320, 326, 355, 422 
Extension theorem, 22 
Extreme-value statistic, 118, 258 
asymptotic distribution of, 261 


F distribution, 246, 542, 549 
table of, 554 

Factorial moment generating function, 79 
Factorial moments, 77, 79 
Factorial notations, 528 
Factorial symbol, 528 
Factorization criterion, 307 
Finite population, sampling from, 267 
Frequency function, 58 
Function, 19 
beta, 535 
convex, 72 

counterdomain of, 19, 53 


Function: 
decision, 297 
definition of, 19 
density (see Distributions) 
distance, 287 

distribution (see Distributions) 

domain of, 19, 53 

gamma, 534 

generating, 84 

image of, 19 

indicator, 20 

likelihood, 278 

loss, 297 

squared-error, 297 

moment generating (see Moment generating 
function) 
power, 406 
preimage, 19 
probability, 21, 22, 26 
regression, 158, 168 
risk, 297, 298 
set, 20,21 
size-of-set, 21 


Game of craps, 48 

Gamma distribution, 111, 112, 123, 540 
Gamma function, 534 
Gauss-Markov theorem, 500 
Gaussian distribution (see Normal distribution) 
Generalized likelihood ratio (see Likelihood 
ratio) 

Generalized variance, 352, 353 
Generating functions (see specific generating 
functions) 

Geometric distribution, 99, 538 
Geometric series, 528 
GJivenko-Cantelli theorem, 507 
Goodncss-of-fit test: 
chi-square, 442, 447 
Kolmogorov-Smirnov, 508, 509 
Gumbel distribution, 1 18, 542 


Homogeneity of populations, test of, 505 
two exponentials, 476 
two multinomials, 450 
two normals, 432, 435 
two Poissons, 451 
two trinomials, 479 

Homogeneity of variances, test of, 438, 439 
Hypergeometric distributions, 91, 538 
Hypotheses, statistical: alternative, 405 
composite, 402 
null, 405 
simple, 402, 409 
(See also Tests of hypotheses) 
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Ideal power function, 405 
Incomplete beta function, 115, 515, 516 
Incomplete gamma function, 114 
Independence, 32, 150, 160 
in contingency tables, 452 
ofevenis.40,41,46 

in probability sense, 32, 143, 150, 160, 161 
of random variables, 150 
of sample mean and variance, 243, 245 
stochastic, 150 
Index set, 13 
Inference, 220, 271 
deductive, 220 
inductive, 220, 221 
Information, 300, 301 
Interquartile range 75 
Intersection, to, 13 
Interval estimation, 372 
Bayesian, 306 
large sample, 393 
(See also Confidence intervals) 

Invariance* 
location, 331-336 

of maximum likelihood estimators, 284, 285, 
442 

scale, 331,336-338 
Inverse binomial sampling, 103 


Jacobian 205 
Jensen inequality, 72 
Joint distributions, 129ft. 
Joint moments, 159 


Kolmogorov Smirnov goodness of fittest, 
508-510 
Kurtosis, 76 


Lagrange multipliers, SOI 
Large sample, 294, 358 
confidence limits, 393 
distribution of estimators 359 
distribution of generalized likelihood ratio, 
440 

distribution of mean, 233 
Law of large numbers, 23 1 , 232, 258 
Least squares, 498 

Lehmann Schefff theorem, 326, 356 
Leibniz rule, 532 
I Hospital s rule, 532 
Lmk problem, 47 
Likelihood function, 278 
induced 285 
Likelihood ratio 
exact distribution of, 480 


Likelihood ratio 
generalized, 4 1 9 

large-sample distribution for, 440 
monotone, 423, 424 
simple, 410, 423 
tests, 409, 419, 440 

Limiting distribution, 196, 258,261, 444,446, 
507-509 

Linear function of normal variates, distribution 
of, 194 

Linear models 482, 485 
confidence intervals, 491 
point estimation 487,498 
tests of hypotheses, 494 
Linear regression, 482 
Location mvamnee, 332-336 
Location parameter, 333 
Logistic, 118, 540 
Lognormal distribution, 1 17, 540 
Loss function, 297, 343 , 4 1 4, 41 S 


Marginal distributions 
for bivariate normal distribution, 167 
continuous, 141 
discrete, 132, 135 
Mass, 53 
Mass point, 58 

Maximum of random variables, 182 
Maximum likelihood, principle of, 276 
Maximum likelihood estimators, 279 
invanance property, 284, 285 
large sample distribution of, 358, 359 
of parameters of normal distribution, 281 
of uniform distribution, 282 
properties Of, 284, 358 
Mean 

definition of 64 
distribution of, 236-238 
sample. 228, 230 
variance of, 231 
Mean absolute deviation, 297 
Mean-squared error, 29 1 , 297 
Mean squared error consistency, 294, 295 
Median, 73,255 
sample, 255 
tests of, 521 

Mendelian inheritance, 445 
Method of moments, 274 
Midrange, 255 

Minimal sufficient statistic, 311 
Mintmax estimator, 299 , 350 
Minimax test 416 
Minimum of random variables, 182 
Minimum chi square estimation, 286 287 
Minimum distance estimation, 286, 287 



INDEX 561 


Minimum-variance unbiased estimator, 
uniformly (UMVUE), 315 
Mixture, 122, 123 
Mode, 74 
Model: 

functional, 483 
linear, 483, 485 

Moment generating function, 72, 78, 80, 159- 
161,164 
factorial, 79 

of random variables, 159, 160 
table of, 538-543 
Moment problem, 81 
Moments, 64, 72, 73 
central, 73 
cumulant, 80 
estimators of, 227 
factorial, 77 
joint, 159 
population, 227 
problem of, 81 
raw, 72, 73 
sample, 227 

Monotone likelihood ratio, 423, 424 
Multinomial distribution, 137, 253,443 
covariance of, 196 
tests on, 448 

Multinomial theorem, 530, 531 
Multiplication rule, 37 
Multivariate distributions, 129ft 
marginal and conditional distributions for, 
132, 135,141,143-147 
moment generating function for, 160 
moments of, 159 
Mutually exclusive, 3, 14, 41 


Negative binomial distribution, 99, 438 
Negative hypergeometric distribution, 213 
Neyman-Pearson lemma, 41 1 
Nonparametric methods, 504 
confidence intervals, 512 
equality of distributions, 518-524 
interval estimates, 512 
Kolmogorov-Smirnov statistic, 508-510 
median, 512 
median test, 521 
point estimation, 512 
quantiles, 512 
rank correlation, 525 
rank-sum test, 522 
run test, 519 

sign test: one-sample, 514 
two-sample, 519 
tests (see Tests of hypotheses) 
tolerance limits, 515 


Normal distribution, 107-111, 120, 239,540, 548 
bivariate, 162, 525 
conditional, 167 

independence of sample mean and variance, 
243, 245 
marginal, 167 

moment generating function for, 164 
multivariate, 162 
regression functions for, 168 
rote of, 239 
sample mean, 240 
sample variance, 241 
table of, 552 
truncated, 124 
Normal equations, 487 
Null hypothesis (see Hypotheses, statistical) 


Order statistics, 251 
asymptotic distribution of, 256-264 
distribution of functions of, 254 


Parameter, 85 
Parameter space, 273, 351 
Parametric family, 85 
Pareto distribution, 118, 542 
Partition, 300 

Pearson’s chi-square tests, 444, 459, 461 
Pitman-closer, 290 
Pitman estimator for location, 334 
Pitman estimator for scale, 337 
Pivotal quantity, 379 
Pivotal quantity method, 379, 387 
Poisson distribution, 93, 104, 119-121, 123, 236, 
538 

compound, 123 
Populations, 222, 224 
sampled, 223 
target, 222 

Posterior Bayes estimator, 341 
Posterior distribution, 340 
Posterior risk, 346 
Power function of test, 406, 41 1 
ideal, 406 

Prior distribution, 340, 417 
Probability, 2 
a posteriori, 5, 9 
a priori, 2-4, 9 
axioms of, 8,22 
classical, 2, 3, 5 
conditional, 32, 42 
properties of, 34 
definition, 3,21,22 
equally likely, 3, 5, 25 
frequency, 2, 5, 6 
function, 21, 22 
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Probability 

independence, 32, 4W2 
integral transformation, 202 
laws of, 23-25, 34-37 
mass function, 58 
models, 8 

properties of, 23-25, 34-17 
space, 25, S3 
subjective,? 
total, 35 

Probability density function, 57, 60, 62, Ml 
Probability function, 58 
equally likely, 26 

Probability generating function, 84 
Probability integral transform, 202 
Problem of moments, 8 1 
Produce of fwo random variables, ISO 
Product notation, 527 
Propagationofe rrors, 1 8 1 


Quantile 73 

point and interval estimates, 512 
tests of hypotheses, 514 
Quotient of two random variables, 180 


Random number, 107,202 
of random variables, 197 
Random sampling 223 
Random variables 53 
continuous, 60, 138 
definition of, 53 
discrete, 57, 133 
joint, 133, 138 
maximum of, 182 
minimum of, 182 
mixed, 62 
product, 180 
quotient, 180 
sums, 178 

(See also Distributions) 

Range, 255 
of function, 19 
Interquartile, 75 
of sample, 255 
Rank-sum test, 522 
Rao-Blackwell theorem, 321, 354 
Rectangular distribution, 105,238, 54o 
References, 544 
Regression, linear, 482 
Regression coefficient 
confidence interval, 491 
estimators. 491 
tests, 494 

Regression curve 158 
for normal, 1(8 


Regression function (tee Regression curve) 
Relative frequency, 3,6, 8 
Reparamcterue, 44 1 , 442 
Risk. 

Bayes, 344 

function, 297, 298,415 
posterior, 346 
Runs. 519 


Sample, 222 

cumulative distribution function, 264 
distribution of. 224 
mean. 227. 228, 230 
median, 255 
midrange, 255 
moments. 22 6 , 227 
quantiles, 2$1 
random 223 
range, 255 
variance, 229, 245 
Sample c.df ,264 
interferences on, 50$ 

Sample mean, 219, 227,228, 230,240 
variance of,2Sl 
Sample moments, 219, 227 
Sample point, 9 
Sample space, 9,14 
finite, 15, 31, 25ff 
Sample variance, 229, 245 
Sampled populations, 223 
Sampling, 27, 219 
with replacement, 27 
without replacement, 27 
Sampling distributions, 219, 224 
for difference of two means, 386 
for mean from b m omi a i population , 23 6 
from Cauchy population, 238 
from exponential, 237 
of large sample j, 232-235 
from normal, 241 
Poisson population, 23 6 
from uniform population, 238 
for order statistics, 25 1 (I 
for ratio of sample variances 247 
for regression coefficients, 490 


Scale invariance, 336-338 
Scale parameter, 336 
Semi invariants, B0 

Sequential probability ratio test, 464, 466-467 
approximate, 468 
expected sample sice of, 468, 470 
Sequential tests, 464 
for binomial, 481 
fundamental Identity for, 470 



INDEX 563 


Sequential tests: 

for mean of normal population, 471 
sample sire in, 468 
Set, 9 

complements, 10 
confidence, 461 
difference, 10 
disjoint, 14 
empty, 10 
equivalence of, 10 
functional 
index, 13 

intersection, 10, 13 
laws of, 11 

mutually exclusive, 14 
null, 10 
theory, 9ff. 
union, 10, 13 
Sigma algebra, 22 
Significance level, 407 
Significance testing, 407, 409 
Simple hypothesis (see Hypotheses, statistical) 
Singular continuous c d. f., 63 
Sire of critical region, 407 
Size of set, 29 
Size of test, 407 
Skewness, 75, 76 
Space, 9 

event, 1, 14,15,18, 23,53 
parameter, 273 
probability, 25 
sample, 1, 14, 31 

Spearman’s rank correlation, 525, 526 
Standard deviation, 68 
Statistic, 226 

Statistical hypotheses (see Hypotheses, 
statistical) 

Statistical inference, 271 
Statistical tests (see Tests of hypotheses) 
Stieltjcs integral, 69 
Stirling’s formula, 530 
Stochastic independence, 150 
Student’s t distribution, 249,250, 542, 550 
table of, 556 
Subset, 10 

Sufficient statistics, 299, 301, 306, 321, 391 
complete, 321, 324, 326 
factorization criterion, 307 
jointly, 306,307 
minimal, 311,312, 326 
tests of hypotheses, 408 
Summation notation, 527 
Sums of random variables: 
covariance of, 179 
distribution, 192 
variance of, 178 
Symmetrically distributed, 170 


/ distribution (see Student’s t distribution) 
Tables of distributions, 538-543 
chi-square, 553 
F t 554, 555 
normal, 551, 552 
Student’s f, 556 
Target populations, 222 
Taylor series, 533 
Tests of hypotheses, 27 1 , 40 1-403 
Bayes, 417, 418 
chi-square, 440 
composite, 418 
and confidence intervals, 461 
critical function of, 404 
critical region of, 403 
distribution-free (see Nonparametric 
methods) 

cquaJity-of-mcans, 432, 435 
equality of two distributions, 518 
equality of two multinomials, 448 
goodness-of-fit, 442, 447 
homogeneity (see Homogeneity of 
populations) 

homogeneity of variances, 438, 439 
independence in contingency tables, 452 
large-sample, 440 
likelihood-ratio: generalized, 419 
simple, 410, 419 

mean of normal population, 428—431 

median, 521 

minimax, 416 

most powerful, 4 1 0, 4 1 1 

nonrandomized, 403 

null hypothesis, 405 

power of, 406 

randomized, 403, 404 

rank -sum, 522 

ratio of variances, 438 

relation to confidence intervals, 461 

run, 519 

sequential (see Sequential tests) 
of significance, 407 
simple, 409 
size of, 407 

sufficient statistics, 408 
unbiased, 425 

uniformly most powerful, 421 
on variances, 431, 432, 438 
Tests of significance, 407 
Ticktacktoe problem, 62 
Tolerance limits, 505, 515, 516 
Total probabilities, theorem of, 35, 148, 149 
Transformations, 175, 198, 202 
c.d.f. technique, 181 
m gX technique, 189 
probability integral, 202, 203 
Treatment effect, 437, 519 
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Truncated distribution, fOJ, 104. 122, 122 
norma!, 124 
Poisson, 104 
Type 1 and II errors, 403 
size of, 405 


UMVUE (uniformly minimum variance 
unbiased estimator), 315 
Unbiased cstimator(s), 293, 315, 352 
best linear, 499 
joint, 352 

uniformly minimum variance 315 
Unbiased test, 425 

Uncorrelated random variables 161, 173,174 
Uniform distribution 105,238,540 
Uniformly most accurate, 464 
Uniformly most powerful test, 421 
composite, 421 
simple, 411 
unbiased, 425 
Union of sets, 10, 13 


Variance, 67 
analysis of, 437 
conditional, 159 
definition of, 67 
distribution o! sample, 241 
estimator of, 229 

of linear combination of random variables, 
178,179 

lower bound for, 315 
sample, 229, 245 
of sample mean, 23 1 
of sum of random variables, 178 
rests ol, 431, 438 

Variance covariance matrix, 352,489 
Variance ratio, 246, 437 
Vector parameters, 351 
Venn diagrams, 11-13 


Waiting <tme, 101, 103 
Walds equation, 470 
Weibull distribution, 117, 542 
Wilks' generalized variance, 353 



